Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete...
Transcript of Intelligent Tutoring System using interaction logs for MOOCs...In a MOOC, each student’s complete...
Intelligent Tutoring System usinginteraction logs for MOOCs
MTech Project Report
Submitted by
Nilesh BirariRoll No 123059015
under the guidance of
Prof D B Phatak
Department of Computer Science and EngineeringIndian Institute of Technology Bombay
July 2015
i
ii
Acknowledgement
I would like to thank my guide Prof Deepak B Phatak for his consistent directionsand guidance throughout the project Because of his encouragement and giving us rightdirections we are able to do this project work efficiently and correctly I would also liketo thank Mr Nagesh Karmali for his continuous support throughout project work
Nilesh BirariMTechDepartment(CSE)IIT Bombay
iii
Abstract
A massive open online course(MOOCs) is an online course aimed at unlimited partici-pation and open access via the web In a MOOC each studentrsquos complete interactionwith the course materials is recorded as a clickstream which produces vast amountof data understanding of student learning using this data enables more intelligentinteractive engaging and effective education The analysis of data using various datamining techniques give more insight into different student learning behaviour that canlead to take better decision for educators and researcher for future offering of the courseAnalysis of student learning behaviour can also be used for personalized guidance foreach group of student
ITS aims to provide personalised instruction for each student so the student model isthe key component of ITS which store the student information(student knowledge stateand behaviour analysis of student) Diagnosis of the student knowledge state is the mostdifficult process due to uncertain and imprecise information about the student To man-age uncertainty in data Bayesian network has the most sound mathematical foundationsand also a natural way of representing uncertainty using probabilities The proposedwork builds Bayesian student model for concept wise understanding of student state ofknowledge
iv
Contents
1 Introduction 111 MOOCs 112 Challenges in MOOCs 113 Motivation 214 Intelligent Tutoring System for MOOCs 3
2 Literature Review 521 Review of Educational Data Mining 522 Adaptive and Intelligent Technologies 7
221 Adaptive Hypermedia 8222 ITS technologies 9223 Existing Systems 9
3 The Open edX frontend architecture 1131 Units(Weeks) sequences panels and modules 1132 Naming schemes 1233 Events 12
4 Open edx research data 1341 Student Info and Progress Data 1342 Course Content Data 14
421 Shared Fields 14422 Course Data 15423 Course Building Block Data 16424 Course Component Data 17
43 Tracking module id in Course Structure 18
5 Events in the Tracking Logs 2251 Common Fields 2252 Student Events 23
521 Navigational Events 23522 Video Interaction Events 24523 Problem Interaction Events 24524 Discussion Forums Events 25
53 New Findings in tracking logs 25
6 Tools and Technologies 2761 Primary requirements for development 2762 Python Bayes Network Toolbox 28
v
63 Jayes - Bayesian Networks for Java 2864 moocdb 29
7 Experiments With Data 3071 Pre-Processing and Data Cleaning 3072 Feature Extraction 3173 Clustering Experiments and results analysis 32
731 Experiment 1 32732 Experiment 2 33733 Experiment 3 35734 Experiment 4 36735 Experiment 5 36
8 Student modeling using Bayesian Network 3881 Student Modeling 38
811 Information in Student model 38812 Uncertainty-Based User Modeling 39
82 Bayesian Diagnostic Algorithm for Student Modeling 40821 Bayesian Network 40822 Bayesian Student Modeling 41
83 Experiments and Results 43831 Nodes and relations 44832 Parameter specification 44
9 Conclusion and Future work 46
vi
List of Figures
11 Work flow of MOOCs 212 Components of ITS 4
21 The cycle of applying data mining in educational systems 6
31 Courseware structure 11
61 Bayesian Network implementation steps [1] 28
81 Overlay model 3982 Bayesian Network model 4283 Analysis of student understanding for concept in CS1011x 45
vii
List of Tables
21 Existing Systems 10
71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37
viii
Chapter 1
Introduction
11 MOOCs
A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources
MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field
12 Challenges in MOOCs
Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes
The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet
1
Figure 11 Work flow of MOOCs
the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course
13 Motivation
In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning
In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design
2
future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education
Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]
14 Intelligent Tutoring System for MOOCs
Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need
For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses
Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand
Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]
bull Domain Model
bull Student Model
bull Tutoring Model
bull Interface Model
3
Figure 12 Components of ITS
Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system
Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system
Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill
Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface
4
Chapter 2
Literature Review
Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system
21 Review of Educational Data Mining
WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo
Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems
In [10] the authors categorize work in educational data mining into
bull Statistics and visualization
bull Web mining
ndash Clustering classification and outlier detection
ndash Association rule mining and sequential pattern mining
ndash Text mining
5
Figure 21 The cycle of applying data mining in educational systems[5]
In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)
A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined
bull Prediction
ndash Classification
ndash Regression
ndash Density estimation
bull Clustering
bull Relationship mining
ndash Association rule mining
ndash Correlation mining
ndash Sequential pattern mining
ndash Causal data mining
bull Distillation of data for human judgment
bull Discovery with models
6
Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]
There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults
Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving
Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations
Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour
22 Adaptive and Intelligent Technologies
As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user
7
221 Adaptive Hypermedia
In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student
AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author
Adaptive Hypermedia performs following three functions
bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not
bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support
bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation
2211 Adaptive Presentation
[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be
bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)
bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model
bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values
8
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
i
ii
Acknowledgement
I would like to thank my guide Prof Deepak B Phatak for his consistent directionsand guidance throughout the project Because of his encouragement and giving us rightdirections we are able to do this project work efficiently and correctly I would also liketo thank Mr Nagesh Karmali for his continuous support throughout project work
Nilesh BirariMTechDepartment(CSE)IIT Bombay
iii
Abstract
A massive open online course(MOOCs) is an online course aimed at unlimited partici-pation and open access via the web In a MOOC each studentrsquos complete interactionwith the course materials is recorded as a clickstream which produces vast amountof data understanding of student learning using this data enables more intelligentinteractive engaging and effective education The analysis of data using various datamining techniques give more insight into different student learning behaviour that canlead to take better decision for educators and researcher for future offering of the courseAnalysis of student learning behaviour can also be used for personalized guidance foreach group of student
ITS aims to provide personalised instruction for each student so the student model isthe key component of ITS which store the student information(student knowledge stateand behaviour analysis of student) Diagnosis of the student knowledge state is the mostdifficult process due to uncertain and imprecise information about the student To man-age uncertainty in data Bayesian network has the most sound mathematical foundationsand also a natural way of representing uncertainty using probabilities The proposedwork builds Bayesian student model for concept wise understanding of student state ofknowledge
iv
Contents
1 Introduction 111 MOOCs 112 Challenges in MOOCs 113 Motivation 214 Intelligent Tutoring System for MOOCs 3
2 Literature Review 521 Review of Educational Data Mining 522 Adaptive and Intelligent Technologies 7
221 Adaptive Hypermedia 8222 ITS technologies 9223 Existing Systems 9
3 The Open edX frontend architecture 1131 Units(Weeks) sequences panels and modules 1132 Naming schemes 1233 Events 12
4 Open edx research data 1341 Student Info and Progress Data 1342 Course Content Data 14
421 Shared Fields 14422 Course Data 15423 Course Building Block Data 16424 Course Component Data 17
43 Tracking module id in Course Structure 18
5 Events in the Tracking Logs 2251 Common Fields 2252 Student Events 23
521 Navigational Events 23522 Video Interaction Events 24523 Problem Interaction Events 24524 Discussion Forums Events 25
53 New Findings in tracking logs 25
6 Tools and Technologies 2761 Primary requirements for development 2762 Python Bayes Network Toolbox 28
v
63 Jayes - Bayesian Networks for Java 2864 moocdb 29
7 Experiments With Data 3071 Pre-Processing and Data Cleaning 3072 Feature Extraction 3173 Clustering Experiments and results analysis 32
731 Experiment 1 32732 Experiment 2 33733 Experiment 3 35734 Experiment 4 36735 Experiment 5 36
8 Student modeling using Bayesian Network 3881 Student Modeling 38
811 Information in Student model 38812 Uncertainty-Based User Modeling 39
82 Bayesian Diagnostic Algorithm for Student Modeling 40821 Bayesian Network 40822 Bayesian Student Modeling 41
83 Experiments and Results 43831 Nodes and relations 44832 Parameter specification 44
9 Conclusion and Future work 46
vi
List of Figures
11 Work flow of MOOCs 212 Components of ITS 4
21 The cycle of applying data mining in educational systems 6
31 Courseware structure 11
61 Bayesian Network implementation steps [1] 28
81 Overlay model 3982 Bayesian Network model 4283 Analysis of student understanding for concept in CS1011x 45
vii
List of Tables
21 Existing Systems 10
71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37
viii
Chapter 1
Introduction
11 MOOCs
A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources
MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field
12 Challenges in MOOCs
Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes
The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet
1
Figure 11 Work flow of MOOCs
the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course
13 Motivation
In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning
In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design
2
future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education
Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]
14 Intelligent Tutoring System for MOOCs
Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need
For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses
Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand
Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]
bull Domain Model
bull Student Model
bull Tutoring Model
bull Interface Model
3
Figure 12 Components of ITS
Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system
Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system
Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill
Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface
4
Chapter 2
Literature Review
Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system
21 Review of Educational Data Mining
WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo
Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems
In [10] the authors categorize work in educational data mining into
bull Statistics and visualization
bull Web mining
ndash Clustering classification and outlier detection
ndash Association rule mining and sequential pattern mining
ndash Text mining
5
Figure 21 The cycle of applying data mining in educational systems[5]
In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)
A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined
bull Prediction
ndash Classification
ndash Regression
ndash Density estimation
bull Clustering
bull Relationship mining
ndash Association rule mining
ndash Correlation mining
ndash Sequential pattern mining
ndash Causal data mining
bull Distillation of data for human judgment
bull Discovery with models
6
Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]
There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults
Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving
Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations
Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour
22 Adaptive and Intelligent Technologies
As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user
7
221 Adaptive Hypermedia
In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student
AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author
Adaptive Hypermedia performs following three functions
bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not
bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support
bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation
2211 Adaptive Presentation
[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be
bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)
bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model
bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values
8
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
ii
Acknowledgement
I would like to thank my guide Prof Deepak B Phatak for his consistent directionsand guidance throughout the project Because of his encouragement and giving us rightdirections we are able to do this project work efficiently and correctly I would also liketo thank Mr Nagesh Karmali for his continuous support throughout project work
Nilesh BirariMTechDepartment(CSE)IIT Bombay
iii
Abstract
A massive open online course(MOOCs) is an online course aimed at unlimited partici-pation and open access via the web In a MOOC each studentrsquos complete interactionwith the course materials is recorded as a clickstream which produces vast amountof data understanding of student learning using this data enables more intelligentinteractive engaging and effective education The analysis of data using various datamining techniques give more insight into different student learning behaviour that canlead to take better decision for educators and researcher for future offering of the courseAnalysis of student learning behaviour can also be used for personalized guidance foreach group of student
ITS aims to provide personalised instruction for each student so the student model isthe key component of ITS which store the student information(student knowledge stateand behaviour analysis of student) Diagnosis of the student knowledge state is the mostdifficult process due to uncertain and imprecise information about the student To man-age uncertainty in data Bayesian network has the most sound mathematical foundationsand also a natural way of representing uncertainty using probabilities The proposedwork builds Bayesian student model for concept wise understanding of student state ofknowledge
iv
Contents
1 Introduction 111 MOOCs 112 Challenges in MOOCs 113 Motivation 214 Intelligent Tutoring System for MOOCs 3
2 Literature Review 521 Review of Educational Data Mining 522 Adaptive and Intelligent Technologies 7
221 Adaptive Hypermedia 8222 ITS technologies 9223 Existing Systems 9
3 The Open edX frontend architecture 1131 Units(Weeks) sequences panels and modules 1132 Naming schemes 1233 Events 12
4 Open edx research data 1341 Student Info and Progress Data 1342 Course Content Data 14
421 Shared Fields 14422 Course Data 15423 Course Building Block Data 16424 Course Component Data 17
43 Tracking module id in Course Structure 18
5 Events in the Tracking Logs 2251 Common Fields 2252 Student Events 23
521 Navigational Events 23522 Video Interaction Events 24523 Problem Interaction Events 24524 Discussion Forums Events 25
53 New Findings in tracking logs 25
6 Tools and Technologies 2761 Primary requirements for development 2762 Python Bayes Network Toolbox 28
v
63 Jayes - Bayesian Networks for Java 2864 moocdb 29
7 Experiments With Data 3071 Pre-Processing and Data Cleaning 3072 Feature Extraction 3173 Clustering Experiments and results analysis 32
731 Experiment 1 32732 Experiment 2 33733 Experiment 3 35734 Experiment 4 36735 Experiment 5 36
8 Student modeling using Bayesian Network 3881 Student Modeling 38
811 Information in Student model 38812 Uncertainty-Based User Modeling 39
82 Bayesian Diagnostic Algorithm for Student Modeling 40821 Bayesian Network 40822 Bayesian Student Modeling 41
83 Experiments and Results 43831 Nodes and relations 44832 Parameter specification 44
9 Conclusion and Future work 46
vi
List of Figures
11 Work flow of MOOCs 212 Components of ITS 4
21 The cycle of applying data mining in educational systems 6
31 Courseware structure 11
61 Bayesian Network implementation steps [1] 28
81 Overlay model 3982 Bayesian Network model 4283 Analysis of student understanding for concept in CS1011x 45
vii
List of Tables
21 Existing Systems 10
71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37
viii
Chapter 1
Introduction
11 MOOCs
A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources
MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field
12 Challenges in MOOCs
Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes
The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet
1
Figure 11 Work flow of MOOCs
the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course
13 Motivation
In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning
In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design
2
future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education
Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]
14 Intelligent Tutoring System for MOOCs
Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need
For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses
Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand
Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]
bull Domain Model
bull Student Model
bull Tutoring Model
bull Interface Model
3
Figure 12 Components of ITS
Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system
Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system
Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill
Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface
4
Chapter 2
Literature Review
Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system
21 Review of Educational Data Mining
WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo
Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems
In [10] the authors categorize work in educational data mining into
bull Statistics and visualization
bull Web mining
ndash Clustering classification and outlier detection
ndash Association rule mining and sequential pattern mining
ndash Text mining
5
Figure 21 The cycle of applying data mining in educational systems[5]
In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)
A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined
bull Prediction
ndash Classification
ndash Regression
ndash Density estimation
bull Clustering
bull Relationship mining
ndash Association rule mining
ndash Correlation mining
ndash Sequential pattern mining
ndash Causal data mining
bull Distillation of data for human judgment
bull Discovery with models
6
Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]
There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults
Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving
Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations
Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour
22 Adaptive and Intelligent Technologies
As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user
7
221 Adaptive Hypermedia
In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student
AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author
Adaptive Hypermedia performs following three functions
bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not
bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support
bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation
2211 Adaptive Presentation
[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be
bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)
bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model
bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values
8
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Acknowledgement
I would like to thank my guide Prof Deepak B Phatak for his consistent directionsand guidance throughout the project Because of his encouragement and giving us rightdirections we are able to do this project work efficiently and correctly I would also liketo thank Mr Nagesh Karmali for his continuous support throughout project work
Nilesh BirariMTechDepartment(CSE)IIT Bombay
iii
Abstract
A massive open online course(MOOCs) is an online course aimed at unlimited partici-pation and open access via the web In a MOOC each studentrsquos complete interactionwith the course materials is recorded as a clickstream which produces vast amountof data understanding of student learning using this data enables more intelligentinteractive engaging and effective education The analysis of data using various datamining techniques give more insight into different student learning behaviour that canlead to take better decision for educators and researcher for future offering of the courseAnalysis of student learning behaviour can also be used for personalized guidance foreach group of student
ITS aims to provide personalised instruction for each student so the student model isthe key component of ITS which store the student information(student knowledge stateand behaviour analysis of student) Diagnosis of the student knowledge state is the mostdifficult process due to uncertain and imprecise information about the student To man-age uncertainty in data Bayesian network has the most sound mathematical foundationsand also a natural way of representing uncertainty using probabilities The proposedwork builds Bayesian student model for concept wise understanding of student state ofknowledge
iv
Contents
1 Introduction 111 MOOCs 112 Challenges in MOOCs 113 Motivation 214 Intelligent Tutoring System for MOOCs 3
2 Literature Review 521 Review of Educational Data Mining 522 Adaptive and Intelligent Technologies 7
221 Adaptive Hypermedia 8222 ITS technologies 9223 Existing Systems 9
3 The Open edX frontend architecture 1131 Units(Weeks) sequences panels and modules 1132 Naming schemes 1233 Events 12
4 Open edx research data 1341 Student Info and Progress Data 1342 Course Content Data 14
421 Shared Fields 14422 Course Data 15423 Course Building Block Data 16424 Course Component Data 17
43 Tracking module id in Course Structure 18
5 Events in the Tracking Logs 2251 Common Fields 2252 Student Events 23
521 Navigational Events 23522 Video Interaction Events 24523 Problem Interaction Events 24524 Discussion Forums Events 25
53 New Findings in tracking logs 25
6 Tools and Technologies 2761 Primary requirements for development 2762 Python Bayes Network Toolbox 28
v
63 Jayes - Bayesian Networks for Java 2864 moocdb 29
7 Experiments With Data 3071 Pre-Processing and Data Cleaning 3072 Feature Extraction 3173 Clustering Experiments and results analysis 32
731 Experiment 1 32732 Experiment 2 33733 Experiment 3 35734 Experiment 4 36735 Experiment 5 36
8 Student modeling using Bayesian Network 3881 Student Modeling 38
811 Information in Student model 38812 Uncertainty-Based User Modeling 39
82 Bayesian Diagnostic Algorithm for Student Modeling 40821 Bayesian Network 40822 Bayesian Student Modeling 41
83 Experiments and Results 43831 Nodes and relations 44832 Parameter specification 44
9 Conclusion and Future work 46
vi
List of Figures
11 Work flow of MOOCs 212 Components of ITS 4
21 The cycle of applying data mining in educational systems 6
31 Courseware structure 11
61 Bayesian Network implementation steps [1] 28
81 Overlay model 3982 Bayesian Network model 4283 Analysis of student understanding for concept in CS1011x 45
vii
List of Tables
21 Existing Systems 10
71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37
viii
Chapter 1
Introduction
11 MOOCs
A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources
MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field
12 Challenges in MOOCs
Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes
The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet
1
Figure 11 Work flow of MOOCs
the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course
13 Motivation
In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning
In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design
2
future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education
Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]
14 Intelligent Tutoring System for MOOCs
Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need
For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses
Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand
Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]
bull Domain Model
bull Student Model
bull Tutoring Model
bull Interface Model
3
Figure 12 Components of ITS
Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system
Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system
Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill
Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface
4
Chapter 2
Literature Review
Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system
21 Review of Educational Data Mining
WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo
Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems
In [10] the authors categorize work in educational data mining into
bull Statistics and visualization
bull Web mining
ndash Clustering classification and outlier detection
ndash Association rule mining and sequential pattern mining
ndash Text mining
5
Figure 21 The cycle of applying data mining in educational systems[5]
In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)
A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined
bull Prediction
ndash Classification
ndash Regression
ndash Density estimation
bull Clustering
bull Relationship mining
ndash Association rule mining
ndash Correlation mining
ndash Sequential pattern mining
ndash Causal data mining
bull Distillation of data for human judgment
bull Discovery with models
6
Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]
There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults
Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving
Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations
Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour
22 Adaptive and Intelligent Technologies
As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user
7
221 Adaptive Hypermedia
In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student
AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author
Adaptive Hypermedia performs following three functions
bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not
bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support
bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation
2211 Adaptive Presentation
[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be
bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)
bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model
bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values
8
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Abstract
A massive open online course(MOOCs) is an online course aimed at unlimited partici-pation and open access via the web In a MOOC each studentrsquos complete interactionwith the course materials is recorded as a clickstream which produces vast amountof data understanding of student learning using this data enables more intelligentinteractive engaging and effective education The analysis of data using various datamining techniques give more insight into different student learning behaviour that canlead to take better decision for educators and researcher for future offering of the courseAnalysis of student learning behaviour can also be used for personalized guidance foreach group of student
ITS aims to provide personalised instruction for each student so the student model isthe key component of ITS which store the student information(student knowledge stateand behaviour analysis of student) Diagnosis of the student knowledge state is the mostdifficult process due to uncertain and imprecise information about the student To man-age uncertainty in data Bayesian network has the most sound mathematical foundationsand also a natural way of representing uncertainty using probabilities The proposedwork builds Bayesian student model for concept wise understanding of student state ofknowledge
iv
Contents
1 Introduction 111 MOOCs 112 Challenges in MOOCs 113 Motivation 214 Intelligent Tutoring System for MOOCs 3
2 Literature Review 521 Review of Educational Data Mining 522 Adaptive and Intelligent Technologies 7
221 Adaptive Hypermedia 8222 ITS technologies 9223 Existing Systems 9
3 The Open edX frontend architecture 1131 Units(Weeks) sequences panels and modules 1132 Naming schemes 1233 Events 12
4 Open edx research data 1341 Student Info and Progress Data 1342 Course Content Data 14
421 Shared Fields 14422 Course Data 15423 Course Building Block Data 16424 Course Component Data 17
43 Tracking module id in Course Structure 18
5 Events in the Tracking Logs 2251 Common Fields 2252 Student Events 23
521 Navigational Events 23522 Video Interaction Events 24523 Problem Interaction Events 24524 Discussion Forums Events 25
53 New Findings in tracking logs 25
6 Tools and Technologies 2761 Primary requirements for development 2762 Python Bayes Network Toolbox 28
v
63 Jayes - Bayesian Networks for Java 2864 moocdb 29
7 Experiments With Data 3071 Pre-Processing and Data Cleaning 3072 Feature Extraction 3173 Clustering Experiments and results analysis 32
731 Experiment 1 32732 Experiment 2 33733 Experiment 3 35734 Experiment 4 36735 Experiment 5 36
8 Student modeling using Bayesian Network 3881 Student Modeling 38
811 Information in Student model 38812 Uncertainty-Based User Modeling 39
82 Bayesian Diagnostic Algorithm for Student Modeling 40821 Bayesian Network 40822 Bayesian Student Modeling 41
83 Experiments and Results 43831 Nodes and relations 44832 Parameter specification 44
9 Conclusion and Future work 46
vi
List of Figures
11 Work flow of MOOCs 212 Components of ITS 4
21 The cycle of applying data mining in educational systems 6
31 Courseware structure 11
61 Bayesian Network implementation steps [1] 28
81 Overlay model 3982 Bayesian Network model 4283 Analysis of student understanding for concept in CS1011x 45
vii
List of Tables
21 Existing Systems 10
71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37
viii
Chapter 1
Introduction
11 MOOCs
A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources
MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field
12 Challenges in MOOCs
Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes
The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet
1
Figure 11 Work flow of MOOCs
the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course
13 Motivation
In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning
In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design
2
future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education
Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]
14 Intelligent Tutoring System for MOOCs
Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need
For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses
Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand
Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]
bull Domain Model
bull Student Model
bull Tutoring Model
bull Interface Model
3
Figure 12 Components of ITS
Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system
Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system
Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill
Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface
4
Chapter 2
Literature Review
Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system
21 Review of Educational Data Mining
WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo
Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems
In [10] the authors categorize work in educational data mining into
bull Statistics and visualization
bull Web mining
ndash Clustering classification and outlier detection
ndash Association rule mining and sequential pattern mining
ndash Text mining
5
Figure 21 The cycle of applying data mining in educational systems[5]
In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)
A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined
bull Prediction
ndash Classification
ndash Regression
ndash Density estimation
bull Clustering
bull Relationship mining
ndash Association rule mining
ndash Correlation mining
ndash Sequential pattern mining
ndash Causal data mining
bull Distillation of data for human judgment
bull Discovery with models
6
Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]
There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults
Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving
Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations
Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour
22 Adaptive and Intelligent Technologies
As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user
7
221 Adaptive Hypermedia
In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student
AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author
Adaptive Hypermedia performs following three functions
bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not
bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support
bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation
2211 Adaptive Presentation
[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be
bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)
bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model
bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values
8
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Contents
1 Introduction 111 MOOCs 112 Challenges in MOOCs 113 Motivation 214 Intelligent Tutoring System for MOOCs 3
2 Literature Review 521 Review of Educational Data Mining 522 Adaptive and Intelligent Technologies 7
221 Adaptive Hypermedia 8222 ITS technologies 9223 Existing Systems 9
3 The Open edX frontend architecture 1131 Units(Weeks) sequences panels and modules 1132 Naming schemes 1233 Events 12
4 Open edx research data 1341 Student Info and Progress Data 1342 Course Content Data 14
421 Shared Fields 14422 Course Data 15423 Course Building Block Data 16424 Course Component Data 17
43 Tracking module id in Course Structure 18
5 Events in the Tracking Logs 2251 Common Fields 2252 Student Events 23
521 Navigational Events 23522 Video Interaction Events 24523 Problem Interaction Events 24524 Discussion Forums Events 25
53 New Findings in tracking logs 25
6 Tools and Technologies 2761 Primary requirements for development 2762 Python Bayes Network Toolbox 28
v
63 Jayes - Bayesian Networks for Java 2864 moocdb 29
7 Experiments With Data 3071 Pre-Processing and Data Cleaning 3072 Feature Extraction 3173 Clustering Experiments and results analysis 32
731 Experiment 1 32732 Experiment 2 33733 Experiment 3 35734 Experiment 4 36735 Experiment 5 36
8 Student modeling using Bayesian Network 3881 Student Modeling 38
811 Information in Student model 38812 Uncertainty-Based User Modeling 39
82 Bayesian Diagnostic Algorithm for Student Modeling 40821 Bayesian Network 40822 Bayesian Student Modeling 41
83 Experiments and Results 43831 Nodes and relations 44832 Parameter specification 44
9 Conclusion and Future work 46
vi
List of Figures
11 Work flow of MOOCs 212 Components of ITS 4
21 The cycle of applying data mining in educational systems 6
31 Courseware structure 11
61 Bayesian Network implementation steps [1] 28
81 Overlay model 3982 Bayesian Network model 4283 Analysis of student understanding for concept in CS1011x 45
vii
List of Tables
21 Existing Systems 10
71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37
viii
Chapter 1
Introduction
11 MOOCs
A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources
MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field
12 Challenges in MOOCs
Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes
The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet
1
Figure 11 Work flow of MOOCs
the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course
13 Motivation
In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning
In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design
2
future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education
Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]
14 Intelligent Tutoring System for MOOCs
Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need
For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses
Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand
Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]
bull Domain Model
bull Student Model
bull Tutoring Model
bull Interface Model
3
Figure 12 Components of ITS
Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system
Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system
Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill
Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface
4
Chapter 2
Literature Review
Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system
21 Review of Educational Data Mining
WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo
Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems
In [10] the authors categorize work in educational data mining into
bull Statistics and visualization
bull Web mining
ndash Clustering classification and outlier detection
ndash Association rule mining and sequential pattern mining
ndash Text mining
5
Figure 21 The cycle of applying data mining in educational systems[5]
In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)
A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined
bull Prediction
ndash Classification
ndash Regression
ndash Density estimation
bull Clustering
bull Relationship mining
ndash Association rule mining
ndash Correlation mining
ndash Sequential pattern mining
ndash Causal data mining
bull Distillation of data for human judgment
bull Discovery with models
6
Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]
There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults
Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving
Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations
Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour
22 Adaptive and Intelligent Technologies
As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user
7
221 Adaptive Hypermedia
In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student
AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author
Adaptive Hypermedia performs following three functions
bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not
bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support
bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation
2211 Adaptive Presentation
[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be
bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)
bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model
bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values
8
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
63 Jayes - Bayesian Networks for Java 2864 moocdb 29
7 Experiments With Data 3071 Pre-Processing and Data Cleaning 3072 Feature Extraction 3173 Clustering Experiments and results analysis 32
731 Experiment 1 32732 Experiment 2 33733 Experiment 3 35734 Experiment 4 36735 Experiment 5 36
8 Student modeling using Bayesian Network 3881 Student Modeling 38
811 Information in Student model 38812 Uncertainty-Based User Modeling 39
82 Bayesian Diagnostic Algorithm for Student Modeling 40821 Bayesian Network 40822 Bayesian Student Modeling 41
83 Experiments and Results 43831 Nodes and relations 44832 Parameter specification 44
9 Conclusion and Future work 46
vi
List of Figures
11 Work flow of MOOCs 212 Components of ITS 4
21 The cycle of applying data mining in educational systems 6
31 Courseware structure 11
61 Bayesian Network implementation steps [1] 28
81 Overlay model 3982 Bayesian Network model 4283 Analysis of student understanding for concept in CS1011x 45
vii
List of Tables
21 Existing Systems 10
71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37
viii
Chapter 1
Introduction
11 MOOCs
A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources
MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field
12 Challenges in MOOCs
Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes
The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet
1
Figure 11 Work flow of MOOCs
the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course
13 Motivation
In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning
In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design
2
future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education
Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]
14 Intelligent Tutoring System for MOOCs
Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need
For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses
Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand
Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]
bull Domain Model
bull Student Model
bull Tutoring Model
bull Interface Model
3
Figure 12 Components of ITS
Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system
Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system
Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill
Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface
4
Chapter 2
Literature Review
Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system
21 Review of Educational Data Mining
WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo
Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems
In [10] the authors categorize work in educational data mining into
bull Statistics and visualization
bull Web mining
ndash Clustering classification and outlier detection
ndash Association rule mining and sequential pattern mining
ndash Text mining
5
Figure 21 The cycle of applying data mining in educational systems[5]
In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)
A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined
bull Prediction
ndash Classification
ndash Regression
ndash Density estimation
bull Clustering
bull Relationship mining
ndash Association rule mining
ndash Correlation mining
ndash Sequential pattern mining
ndash Causal data mining
bull Distillation of data for human judgment
bull Discovery with models
6
Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]
There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults
Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving
Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations
Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour
22 Adaptive and Intelligent Technologies
As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user
7
221 Adaptive Hypermedia
In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student
AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author
Adaptive Hypermedia performs following three functions
bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not
bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support
bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation
2211 Adaptive Presentation
[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be
bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)
bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model
bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values
8
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
List of Figures
11 Work flow of MOOCs 212 Components of ITS 4
21 The cycle of applying data mining in educational systems 6
31 Courseware structure 11
61 Bayesian Network implementation steps [1] 28
81 Overlay model 3982 Bayesian Network model 4283 Analysis of student understanding for concept in CS1011x 45
vii
List of Tables
21 Existing Systems 10
71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37
viii
Chapter 1
Introduction
11 MOOCs
A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources
MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field
12 Challenges in MOOCs
Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes
The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet
1
Figure 11 Work flow of MOOCs
the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course
13 Motivation
In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning
In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design
2
future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education
Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]
14 Intelligent Tutoring System for MOOCs
Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need
For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses
Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand
Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]
bull Domain Model
bull Student Model
bull Tutoring Model
bull Interface Model
3
Figure 12 Components of ITS
Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system
Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system
Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill
Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface
4
Chapter 2
Literature Review
Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system
21 Review of Educational Data Mining
WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo
Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems
In [10] the authors categorize work in educational data mining into
bull Statistics and visualization
bull Web mining
ndash Clustering classification and outlier detection
ndash Association rule mining and sequential pattern mining
ndash Text mining
5
Figure 21 The cycle of applying data mining in educational systems[5]
In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)
A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined
bull Prediction
ndash Classification
ndash Regression
ndash Density estimation
bull Clustering
bull Relationship mining
ndash Association rule mining
ndash Correlation mining
ndash Sequential pattern mining
ndash Causal data mining
bull Distillation of data for human judgment
bull Discovery with models
6
Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]
There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults
Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving
Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations
Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour
22 Adaptive and Intelligent Technologies
As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user
7
221 Adaptive Hypermedia
In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student
AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author
Adaptive Hypermedia performs following three functions
bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not
bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support
bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation
2211 Adaptive Presentation
[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be
bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)
bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model
bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values
8
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
List of Tables
21 Existing Systems 10
71 Clustering Experiment 1 results 3372 Clustering Experiment 2 results 3473 Clustering Experiment 3 results 3574 Clustering Experiment 4 results 3675 Clustering Experiment 5 results 37
viii
Chapter 1
Introduction
11 MOOCs
A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources
MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field
12 Challenges in MOOCs
Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes
The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet
1
Figure 11 Work flow of MOOCs
the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course
13 Motivation
In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning
In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design
2
future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education
Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]
14 Intelligent Tutoring System for MOOCs
Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need
For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses
Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand
Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]
bull Domain Model
bull Student Model
bull Tutoring Model
bull Interface Model
3
Figure 12 Components of ITS
Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system
Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system
Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill
Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface
4
Chapter 2
Literature Review
Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system
21 Review of Educational Data Mining
WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo
Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems
In [10] the authors categorize work in educational data mining into
bull Statistics and visualization
bull Web mining
ndash Clustering classification and outlier detection
ndash Association rule mining and sequential pattern mining
ndash Text mining
5
Figure 21 The cycle of applying data mining in educational systems[5]
In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)
A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined
bull Prediction
ndash Classification
ndash Regression
ndash Density estimation
bull Clustering
bull Relationship mining
ndash Association rule mining
ndash Correlation mining
ndash Sequential pattern mining
ndash Causal data mining
bull Distillation of data for human judgment
bull Discovery with models
6
Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]
There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults
Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving
Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations
Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour
22 Adaptive and Intelligent Technologies
As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user
7
221 Adaptive Hypermedia
In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student
AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author
Adaptive Hypermedia performs following three functions
bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not
bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support
bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation
2211 Adaptive Presentation
[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be
bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)
bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model
bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values
8
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Chapter 1
Introduction
11 MOOCs
A massive open online course(MOOCs) is an online course aimed at unlimited participa-tion and open access via the web [2] MOOCs offers wide variety of courses to the largenumber of students There are multiple MOOC platforms (including the edX Courseraand Udacity) offering large number of courses some of which have had hundreds ofthousands of students enrolled Benefits of such on-line courses is they are classroomindependent and platform independentThe courses started at one places can be used bythousands of learners all over the world So with minimum use of the resources and withlittle efforts anybody can participating in such online courses these courses are offeredby expert faculties in course from top class institutes in the world That can provideexperience of learning from highly qualified faculties with rich learning material andresources
MOOCs are considered as a one of the best educational research vehicle for most ofthe researchers in education field as it ability to track each student detailed interactionwith the course materials participation in forum practice problems and assignmentsetc The large data sets created by students interactions in MOOCs provide goodopportunities for research in educational field
12 Challenges in MOOCs
Figure 11 shows the learning workflow in MOOCs When a new course start instructorputs learning material on topics As time passes during the subsequent assessment learnergoes through the available learning material learning material consist of Videos audiotext etc Going through learning material learner attempt for exercises assignmentsquizzes exams etc After assessment results evaluated are provided to the learner andsame procedure follows until the course finishes
The courses offered are nothing but the static traditional hypermedia of the variouslearning material For the diverse learner population this system will suffer from inabilityto be ldquoall things to all peoplerdquo [3] Each student registered in MOOCs has different char-acteristics such as knowledge level preferences interest goals cognitive style learningstyle prerequisite knowledge etc The courses offered by MOOCs could not able to meet
1
Figure 11 Work flow of MOOCs
the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course
13 Motivation
In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning
In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design
2
future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education
Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]
14 Intelligent Tutoring System for MOOCs
Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need
For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses
Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand
Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]
bull Domain Model
bull Student Model
bull Tutoring Model
bull Interface Model
3
Figure 12 Components of ITS
Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system
Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system
Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill
Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface
4
Chapter 2
Literature Review
Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system
21 Review of Educational Data Mining
WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo
Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems
In [10] the authors categorize work in educational data mining into
bull Statistics and visualization
bull Web mining
ndash Clustering classification and outlier detection
ndash Association rule mining and sequential pattern mining
ndash Text mining
5
Figure 21 The cycle of applying data mining in educational systems[5]
In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)
A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined
bull Prediction
ndash Classification
ndash Regression
ndash Density estimation
bull Clustering
bull Relationship mining
ndash Association rule mining
ndash Correlation mining
ndash Sequential pattern mining
ndash Causal data mining
bull Distillation of data for human judgment
bull Discovery with models
6
Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]
There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults
Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving
Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations
Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour
22 Adaptive and Intelligent Technologies
As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user
7
221 Adaptive Hypermedia
In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student
AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author
Adaptive Hypermedia performs following three functions
bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not
bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support
bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation
2211 Adaptive Presentation
[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be
bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)
bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model
bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values
8
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Figure 11 Work flow of MOOCs
the need of each and every individual with different learning demands Due to which mostof the student not able to achieve satisfactory learning in course
13 Motivation
In traditional classroom based courses Due to limited number of students teachercan give special attention for each and every student for their classroom activitiesand can observe learning style knowledge level interests preferences etc In suchenvironment teacher can diagnose the student knowledge on various concept by askingquestions and teach on the basis of the diagnosis result for each student But dueto development in technologies and need to serve for the large number of populationwith courses offered on web with open access to all it is not feasible for MOOCinstructors to manually provide attention to individual with different backgroundsdiverse skill levels learning goals and preferences So there is an increasing need to makedirected efforts towards automatically providing better personalized content in e-learning
In a MOOCs each student complete interaction with the course materials is recordedas a clickstream which produces vast amount of data In [4] author states aboutimportant of such data that can be used for understanding of student learning andenable more intelligent interactive engaging and effective education Understandinghow students interact with MOOCs is a crucial issue because it affects how we design
2
future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education
Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]
14 Intelligent Tutoring System for MOOCs
Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need
For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses
Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand
Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]
bull Domain Model
bull Student Model
bull Tutoring Model
bull Interface Model
3
Figure 12 Components of ITS
Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system
Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system
Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill
Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface
4
Chapter 2
Literature Review
Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system
21 Review of Educational Data Mining
WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo
Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems
In [10] the authors categorize work in educational data mining into
bull Statistics and visualization
bull Web mining
ndash Clustering classification and outlier detection
ndash Association rule mining and sequential pattern mining
ndash Text mining
5
Figure 21 The cycle of applying data mining in educational systems[5]
In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)
A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined
bull Prediction
ndash Classification
ndash Regression
ndash Density estimation
bull Clustering
bull Relationship mining
ndash Association rule mining
ndash Correlation mining
ndash Sequential pattern mining
ndash Causal data mining
bull Distillation of data for human judgment
bull Discovery with models
6
Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]
There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults
Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving
Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations
Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour
22 Adaptive and Intelligent Technologies
As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user
7
221 Adaptive Hypermedia
In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student
AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author
Adaptive Hypermedia performs following three functions
bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not
bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support
bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation
2211 Adaptive Presentation
[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be
bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)
bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model
bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values
8
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
future online courses Identifying student interaction patterns behaviour and sequenceof actions performed by student through clickstream analysis could enable more effectivepersonalized support for student information seeking and learning in MOOCs To do sorequires artificial intelligence and machine learning in the field of education
Artificial intelligence play an important role in learning and education Intelligenttutoring systems using technologies from artificial intelligent to advance learning andteaching In this work we focus on such data-driven development and optimization ofeducational technologies to build intelligent tutoring system for MOOCsThis work isalready carried out in the fields of educational data mining [5][6]and learning analytics[7]
14 Intelligent Tutoring System for MOOCs
Intelligent and Adaptive technologies proposes a solution to overcome limitations of thetraditional online courses ITS systems build a model of the knowledge and learningbehaviour for each individual learner and use this model throughout the interaction inorder to adapt to the needs of individual learner System offers personalised learningmaterial as per individual learning need
For the effective education and learning for MOOCs we will make our contributionto build the intelligent tutoring system which will address challenges in MOOCs listedabove First part is to address the challenge of finding behaviour of the student usingsequence of activity performed resources used and performance in the course similarbehaviour students are grouped into the cluster to provide personalised feedback foreach cluster having similar students And analysis of this data can also lead for betterunderstanding of student learning behaviour for making decisions for future offering ofthe courses
Second part is to estimate the concept wise knowledge of each individual studentparticipated in MOOCs The proposed solution built a model for each individual duringthe assessment and interaction with the system Using the student responses for theassessment system estimate performance of each student for concept to make predictionwhether student understand each concepttopic he learned Using this model systemanalyse individual student need and can provide personalised learning material as perindividuals demand
Majority of the ITS systems build has modularised in some components according tofunctionality Before going further we should familiar with these system and componentsin it As illustrated in figure 12 ITS contains following components[8] [9]
bull Domain Model
bull Student Model
bull Tutoring Model
bull Interface Model
3
Figure 12 Components of ITS
Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system
Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system
Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill
Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface
4
Chapter 2
Literature Review
Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system
21 Review of Educational Data Mining
WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo
Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems
In [10] the authors categorize work in educational data mining into
bull Statistics and visualization
bull Web mining
ndash Clustering classification and outlier detection
ndash Association rule mining and sequential pattern mining
ndash Text mining
5
Figure 21 The cycle of applying data mining in educational systems[5]
In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)
A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined
bull Prediction
ndash Classification
ndash Regression
ndash Density estimation
bull Clustering
bull Relationship mining
ndash Association rule mining
ndash Correlation mining
ndash Sequential pattern mining
ndash Causal data mining
bull Distillation of data for human judgment
bull Discovery with models
6
Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]
There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults
Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving
Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations
Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour
22 Adaptive and Intelligent Technologies
As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user
7
221 Adaptive Hypermedia
In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student
AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author
Adaptive Hypermedia performs following three functions
bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not
bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support
bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation
2211 Adaptive Presentation
[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be
bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)
bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model
bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values
8
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Figure 12 Components of ITS
Domain Model- Domain model also known as expert model it stores informationabout the material being taught This model contains concepts and relationship betweenconcepts solutions for the question lessons and tutorials and problem solving strategiesof the domain It stands as a backbone of the content of the system
Student Model- It represents domain dependent and independent characteristics ofeach user These characteristics are acquired from user explicit responses analysis fromthe interaction data through assessments etc It consider as a core component of theITS system
Tutoring Model- To make the choices about tutoring and action it accepts infor-mation from domain model and student model it guides learner with respective theircurrent state of knowledge to reach proficiency with the target skill
Interaction Model- The model concerned with the representation of the user(usermodel) and representation of the application(domain model) The main aim of the modelis capturing the appropriate raw data of the learner and representing the interface
4
Chapter 2
Literature Review
Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system
21 Review of Educational Data Mining
WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo
Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems
In [10] the authors categorize work in educational data mining into
bull Statistics and visualization
bull Web mining
ndash Clustering classification and outlier detection
ndash Association rule mining and sequential pattern mining
ndash Text mining
5
Figure 21 The cycle of applying data mining in educational systems[5]
In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)
A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined
bull Prediction
ndash Classification
ndash Regression
ndash Density estimation
bull Clustering
bull Relationship mining
ndash Association rule mining
ndash Correlation mining
ndash Sequential pattern mining
ndash Causal data mining
bull Distillation of data for human judgment
bull Discovery with models
6
Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]
There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults
Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving
Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations
Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour
22 Adaptive and Intelligent Technologies
As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user
7
221 Adaptive Hypermedia
In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student
AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author
Adaptive Hypermedia performs following three functions
bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not
bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support
bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation
2211 Adaptive Presentation
[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be
bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)
bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model
bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values
8
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Chapter 2
Literature Review
Our work described here are falls into different categories listed in previous chapter Thereview done mostly includes related work we will presenting in next sections First wewill see some of the research related to data mining and learning analytics and that wewill be presenting in next subsequent sections Later in the sections we will present reviewon Adaptive and Intelligent Technologies in adaptive hypermedia and intelligent tutor-ing system that will require to understand the working of Adaptive and Intelligent system
21 Review of Educational Data Mining
WHAT IS EDMThe Educational Data Mining community website wwweducationaldataminingorg de-fines educational data mining as follows ldquoEducational Data Mining is an emerging dis-cipline concerned with developing methods for exploring the unique types of data thatcome from educational settings and using those methods to better understand studentsand the settings which they learn inrdquo
Data mining techniques can be used to discover useful information to assist educatorsto make decisions or modifications in environment or teaching approaches In [5] showsthat the data mining application in educational systems is an iterative cycle of hypothesisformation testing and refinement (see Fig 21) Mined Knowledge and results entersinto system and and guide facilitate and enhance the learning experience of the learnerby providing recommendation or personalised feedback This knowledge not only helpsto the learner but also to the educators for decision making so they can design planbuild and maintain the educational systems
In [10] the authors categorize work in educational data mining into
bull Statistics and visualization
bull Web mining
ndash Clustering classification and outlier detection
ndash Association rule mining and sequential pattern mining
ndash Text mining
5
Figure 21 The cycle of applying data mining in educational systems[5]
In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)
A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined
bull Prediction
ndash Classification
ndash Regression
ndash Density estimation
bull Clustering
bull Relationship mining
ndash Association rule mining
ndash Correlation mining
ndash Sequential pattern mining
ndash Causal data mining
bull Distillation of data for human judgment
bull Discovery with models
6
Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]
There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults
Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving
Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations
Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour
22 Adaptive and Intelligent Technologies
As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user
7
221 Adaptive Hypermedia
In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student
AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author
Adaptive Hypermedia performs following three functions
bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not
bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support
bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation
2211 Adaptive Presentation
[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be
bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)
bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model
bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values
8
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Figure 21 The cycle of applying data mining in educational systems[5]
In same work he has listed data mining with different kinds of education systemsincludes (see fig 21) Traditional classrooms and Distance education systems (Particularweb-based courses Well-known learning content management systems Adaptive andintelligent web-based educational systems)
A different viewpoint on educational data mining stated in [6][11] where the followingcategories are defined
bull Prediction
ndash Classification
ndash Regression
ndash Density estimation
bull Clustering
bull Relationship mining
ndash Association rule mining
ndash Correlation mining
ndash Sequential pattern mining
ndash Causal data mining
bull Distillation of data for human judgment
bull Discovery with models
6
Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]
There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults
Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving
Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations
Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour
22 Adaptive and Intelligent Technologies
As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user
7
221 Adaptive Hypermedia
In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student
AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author
Adaptive Hypermedia performs following three functions
bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not
bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support
bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation
2211 Adaptive Presentation
[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be
bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)
bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model
bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values
8
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Besides the different ways of categorization Classification clustering Association rulemining and sequential pattern mining are the most used categories found in educationdata mining research The process of data mining in educational settings split into datacollection data preprocessing application of data mining and interpretation evaluationand deployment of the results [12]
There is lot of research on MOOCs from last couple of years specially in the field ofEducational data mining and learning analytics Most of the research is on behaviour ofthe student engagement in course dropout rate performance prediction and some otherresults
Our work is based on clustering of the similar behaviour students on the basis oftheir performance resource use forum participation and other interactions in the coursesimilar work on the basis of user activity sequence and their performance in the courseis described in [13] Authors work contains Activity sequence modeling and dynamicclustering on the problem solving sequence of actions to find student with similarbehaviour action sequence during problem solving
Activity data mining is commonly used technique in most of the researches in thefield of educational data mining In [12] [14] the authors describe a data mining processbased on aggregated log data The logged data contains every single click a user makesfor navigational purposes The features considered for the mining are the number ofassignments done by a student the number of quizzes failed the number of quizzespassed the total time spent on assignments etc The goal is the evaluation of theperformance of different classification algorithms for the prediction of students finalgradesIn [15] author aims at predicting how suitable a specific course is for a specificstudent via classification in order to provide personalized recommendations
Student Modelling Based on Clustering is a prime focus topic for the most of theresearchers in data mining student model is the key component in any system used toprovide personalised e-learning In [16] we can find a detailed about clustering approachto user modelling The data they used converted to the feature vector for each student andthen this feature vectors fed to the clustering phase each feature vector represents stu-dent all activities In [17] they used clustering approach based on collaboration behaviour
22 Adaptive and Intelligent Technologies
As we have seen the benefits of MOOCs and need of the ITS for MOOCs In this section wewill take a brief survey of the existing ITS and Adaptive Hypermedia Systems Along withwe will study what are the adaptive and intelligent technologies available and how theyare implemented on web Most of the adaptive and intelligent technologies are inheritedfrom Adaptive Hypermedia and Intelligent Tutoring System The study presented in thissection in not used for our work but we should be familiar with such existing systemsfor better understanding of ITS Adaptive hypermedia is the part of adaption requiredto present personalised content to every user ITS technologies presented are requires toenhance the learning experience of the user
7
221 Adaptive Hypermedia
In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student
AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author
Adaptive Hypermedia performs following three functions
bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not
bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support
bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation
2211 Adaptive Presentation
[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be
bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)
bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model
bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values
8
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
221 Adaptive Hypermedia
In [3] author describes static hypermedia applications has some limitation that theyprovides same set of presentation and links to all user For example they provides samestatic explanation and same next page to students with widely different educational goalsand knowledge of the subject To overcome the limitations of static hypermedia adaptivehypermedia system builds model of knowledge goals and preferences of each individualstudent and use this through- out the the interaction with student in order to adapt tothe need of that student
AHAM described in [18] is one of the most popular reference model to supportadaptive hypermedia authoring AHA(Adaptive Hypermedia Architecture) system whichis authoring tool for the adaptive hypermedia application which is build on the samereference model Below are the functions of the model described by the author
Adaptive Hypermedia performs following three functions
bull When user interact with the system system captures interaction with the userBasedon this observation Hypermedia system maintains model of the user about eachdomain concept System store a knowledge value for each concept Knowledgevalue shows the information about how much user does knows about the conceptand also represent whether user read that concept or not
bull According to current knowledge interest or goal nodes or pages are classified intointo several groups using user model The system manipulated link anchors withinpages (and link destination) to guide user towards interesting relevant informationLink anchor could be specially annotated disabled or removed This method calledas adaptive navigation support
bull Content of the pages contains appropriate information ie expected level of dif-ficulty or details for each user according to user model When presenting systemconditionally show hide highlight or dim conditional fragments on a page Thisis done in order to ensure that page contains appropriate information for a user atgiven time This method called as adaptive presentation
2211 Adaptive Presentation
[18] Adaptive presentation of the information is nothing but the manipulation of the textfragments this manipulation can be
bull Providing prerequisite additional or comparative explanation Techniques used forsuch presentation are Conditional inclusion of fragments and Stretch text In Condi-tional inclusion of fragments user model and concept relation model(domain model)determines which fragment should be displayedIn Stretch text for each fragmentthere is a placeholder System determines which fragments to be rdquostretchedrdquo orrdquoshrunkrdquo (ie only the placeholder is shown)
bull Providing explanation variants The same information can be presented in differentways depends on level of difficulty value in user model
bull Reordering information Some user may prefer example before definition while otherprefer other way around all this reordering is done using the user model values
8
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
2212 Adaptive Navigation Support
[18] The manipulation [6] of links that are presented within nodes (pages) is typicallydone in one or more of the following waysDirect guidance System informs student that lsquonextrsquo button lead to the most appropriatepage in hyperspace according to the current knowledge levelSorting of links The presented list of links is sorted from most relevant to least relevantLink annotation Link anchors are presented differently depending on the relevance of thedestinationLink hiding Inappropriate or non-relevant information containing link are hidden frompresented pageLink disabling Inappropriate links are disabledLink removal Inappropriate links are simply removed
222 ITS technologies
In [19] author describes ITS asITS uses knowledge about domain student and teach-ing strategies to support individualised learning and tutoring Curriculum sequencingproblem solving support are the core technologies used by the most of the ITS
2221 Curriculum sequencing
Curriculum sequencing finds lsquooptimal pathrsquo through the learning material by providingstudent with the most suitable individually sequence of knowledge units to learn andsequence of learning tasks (examples questions problems etc)
2222 Problem solving support
Problem solving support is one of the most widely used technologies in most of the ITSsystems There are three different approaches are Intelligent analysis of student solutionsInteractive problem solving support Example-based problem solving support
Intelligent analysis of student solutions technology decides whether solution iscorrect or not find out what is wrong or incorrect in the solution and identifies whichmissing or incorrect knowledge may responsible for the incorrect solution
Interactive problem solving support provides student with intelligent help oneach step of problem solving System monitors student action understand them and usethis understanding to update student model and provide them appropriate help
Example-based problem solving technology help students to solve problem bysuggesting them relevant successful problem from their earlier experience
223 Existing Systems
In the field of adaptive and intelligent technologies for education there are many systemswere proposed from last two decades Most of the system build uses the techniques listedabove The table shows the some of the most popular systems and the technologies usedby those systems [19][20] [21]
Most of the existing adaptive and intelligent systems are aimed at specific applicationsie specific to some domain Such systems are closed and difficult to reuse for otherdomain application Few systems like AHA offers authoring tools to build the adaptivecourse content for each individual But with the study of MOOCs we can conclude that
9
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
System Description Technologies usedAHA open adaptive hypermedia
architecturePlatform foradaptive hypermedia
adaptive presentation adaptivenavigation support
ELM-ART Intelligent interactive sys-tem to learning programingin lisp
adaptive sequencing adaptivepresentation adaptive navigationsupportproblem solving support
InterBook Authoring and Deliver-ing adaptive electronictextbooks
adaptive sequencing adaptivepresentation adaptive navigationsupport
KBS Hyperbook textbook in hypertext doc-ument
adaptive sequencing adaptivenavigation support
NetCoach Authoring system that meetthe needs to create adaptivelearning course
curriculum sequencing adaptiveannotation of links
Metalink Authoring tools for adap-tive hyperbook
page sequencing adaptive presen-tation
SIETTE ITS for classification andidentification of differentvegetable species
Question sequencing
SQL-Tutor support learning SQL Intelligent solution analysis
Table 21 Existing Systems
there is requirement of reliable platform which can make teaching and learning easier Theplatform is nothing but the Learning Management System(LMS) that can handle servingtests grading assignments pushing course material providing user friendly interfacecommunication through chat rooms discussion forum and many other feature presentin current LMS So these features difficult to build and handle by the any stand aloneadaptive learning system Proposed solution for this problem is to integrate the adaptiveand intelligent technologies within existing LMS
10
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Chapter 3
The Open edX frontend architecture
31 Units(Weeks) sequences panels and modules
The Open edX course material is organised in three hierarchical levels (see fig 31 ) thatwe refer to as units sequences and panels[22]
Figure 31 Courseware structure
UnitContains all course material belonging to a particular weekSequenceIt is section in week groups course material by topicPanelEach sequence has a horizontal bar that contains the panels in sequence sequencecontains set of consecutive panelsModulePanel contains resources like video or problem These atomic components are referred toas modules
11
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
32 Naming schemes
In Open edX platforms URL patterns partially represents hierarchical structure of thecourse URLs do not contain information about the courseware modules that appear onpanels Modules are named using separate platform specific naming scheme
bull URLs Examples of course URLs corresponding to each level of the course hierarchy
ndash Unit(Week)-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x2T2014
courseware5773a6ab6319440aa5889ae38e578402
lsquo5773a6ab6319440aa5889ae38e578402rsquo represents the unique identifier of theUnit
ndash Sequence-level URLwwwcoursesedxorgcoursesIITBombayXCS1011x
2T2014courseware5773a6ab6319440aa5889ae38e578402
ba24809f572846458e1c78e09411863a
lsquoba24809f572846458e1c78e09411863arsquo unique identifier corresponds to partic-ular sequence lsquo5773a6ab6319440aa5889ae38e578402rsquo is a week identifier andwill be same for all URLs of a sequences those belongs to that unit
ndash Panel-level URLFor every panel the URL will be same as that represented by Sequence ofcorresponding panel There is no change in URL during navigation betweenany two panel in same sequence
bull Module URIs
Problem or question or any other courseware resource are referred as module inOpen edX Modules are identified by URIs using the lsquoi4xrsquo namespaceHere the examplei4xIITBombayXCS1011xvideof61de37103ab42f2b50f5f5e43489e70
These modules are displayed one course panel and panel can contain multiplemodules
33 Events
In events log we can distinguish the event between interaction event or navigational event
bull Interaction eventsAll the engagement actions captured on particular course module are named asinteraction events The captured actions are like playing video problem check etcThese actions do not change the location of the user ie URL The metadataassociated with interaction event contains information about the module the user isengaged That way interaction events give information about the URLs on whichinteraction happened and information about the module the user engaged with
bull Navigational eventsNavigation events captures the information about when the user is moving in thecourse hierarchy These events captures information about accessing new URL andswitching course panel while staying on the same page
12
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Chapter 4
Open edx research data
For our work we are using data of the courses offered by IIT Bombay on edx platformedx makes research data available for download from amazon S3 for partner running theircourses on edx The data available are delivered in data packets that contains event logand database snapshots for courses offered by the institute Event data file contains a dailylog of course events A separate file is available for courses running on edx Database Datafile contains views on database tables This file includes all of an organizationrsquos coursesdata available at the time of the export A new file is available every week representingthe database at that point in time [23] More details see[23]
41 Student Info and Progress Data
edX stores stateful data for students internally and available for developers and researchersin database export in data packets Database data contains Student Info and ProgressData Data for students is presented in User Data Courseware Progress Data CertificateData[23]
bull User DataUser data contains following tables store data gathered during site registration andcourse enrollment
ndash auth user
ndash auth userprofile
ndash student courseenrollment
ndash user api usercoursetag
ndash user id map
ndash student languageproficiency
For our work we requires the student enrolled for the course which is found instudent courseenrollment Table From this table we can find student id for thestudents who enrolled in course
bull Courseware Progress DataEverything(each module) in the courseware is stored courseware studentmodule ta-ble with its state or score for every student We will use this table to read studentprogress data
13
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
bull Certificate Data certificates generatedcertificate table tracks the state of certifi-cates and final grades for a course
42 Course Content Data
For each course the database files contains a JSON file To gain overview of the courseand investigate course state at a point researchers can use this file This section describesthe contents of the course structure file
bull Shared Fields
bull Course Data
bull Course Building Block Data
bull Course Component Data
421 Shared Fields
Category children metadataThe these fields are present for all of the objects in thecourse structure file
bull Course Structure category FieldIn the course structure JSON file category field identifies structural elements of acourse Each file contains single lsquocoursersquo category object and other objects fromeach of the categoriesFor each object in the file the category field contains following different categories
ndash chapter The sections defined in the course outline
ndash course The dates settings and other metadata defined for the course as awhole
ndash discussion The content-specific discussion components defined for the course
ndash html The HTML components defined for the course
ndash problem The problem components defined for the course
ndash sequential The subsections defined in the course outline
ndash vertical The units defined in the course outline
ndash video The video components defined for the course
bull Course Structure children FieldThe children field is an array in which each elements are the modules that a specificstructural element in the course contains
bull Course Structure metadata FieldThis field contains key-value pairs that describe the settings defined for the modules
14
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
422 Course Data
In category lsquocoursersquo object the children field contains list of all chapters defined in thatcourse The metadata fields contains the information about parameter defined for thecourse includes dates pages textbooks and advanced setting values
Example of the course object defined for CS1011x
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xchapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
tabs [
name Courseware
type courseware
name Course Info
type course_info
name Textbooks
15
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
type textbooks
name Discussion
type discussion
name Wiki
type wiki
name Progress
type progress
name CS1011x Course Team
type static_tab
url_slug 84b41401f15b4a83a44cda2eb8a689fd
]
423 Course Building Block Data
Course is organised by defining hierarchical sections subsections and units Internally theedX identifies these building blocks as lsquochapterrsquo lsquosequentialrsquo and lsquoverticalrsquo The childrenfield for each of these object contains list of object identifiers it contain
bull chapter(section) contains list of sequentials(subsections)
bull sequential(subsections) contains list of verticals(units)
bull vertical(unit) list all components that they contain
The metadata field contains information about set of parameters defined by courseteam for each of them
Sample from CS1011X course
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
16
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
424 Course Component Data
Course content are specified by adding components to the unit(vertical) Core or basiccomponent for any course are lsquodiscussionrsquo lsquohtmlrsquo lsquoproblemrsquo and lsquovideorsquoCourse Component Data Sample for CS1011X course
i4xIITBombayXCS1011xhtml072ddf0da3534ac89adf058b92ef0f0e
category html
children []
metadata
display_name Selection Sort
17
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
i4xIITBombayXCS1011xproblembf0c201fde0c40e380c975b6ed93a124
category problem
children []
metadata
display_name Q13
markdown ltbgtQ13 What will be the output produced by the following program segmentltbgtnltpre style=rsquofont-family Courier New Courier monospacersquo gtnltbgtnint xicount=0 nforamp40i=-1iamplt=3i++amp41amp123 n x=i n ifamp40amp40xxx + xx - xamp41 == 1amp41 n count++ namp125 ncoutampltampltcount nltbgtnltpregtnWrite the output value as your answern= 2 n[explanation] nSince the value of ltbgtxxx + xx - xltbgt evaluates to 1 only when i is -1 and 1 and both -1 and 1 are within the range of values of i considered in the for loop the variable rsquocountrsquo increments only twicen[explanation]
max_attempts 1
weight 20
i4xIITBombayXCS1011xdiscussion0f364659ab24464fa95bfe77c86a5aa5
category discussion
children []
metadata
discussion_category Week 2
discussion_id 6fd74752fae64748bfee05ea4f9ae6ca
discussion_target Structure of a Simple C++ Program
43 Tracking module id in Course Structure
To identify student interaction behaviour in courseware it important to have knowledgeabout the courseware resources and their hierarchy of the course materialWe have seen
18
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
the details about the course structure in previous section JSON file in the database filepackets delivered by edX contains courseware structure information Here we will seehow to identify modules id of various courseware resources like problem video discussionforum etc within courseware hierarchy
As we have seen in previous section about Course Building Block Data in CourseContent Data using this information we can identify module id in course structureHere we see the exampleFirst identify the category lsquocoursersquo in course structure (json) file
i4xIITBombayXCS1011xcourse2T2014
category course
children [
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
i4xIITBombayXCS1011xchapter69e79228b2ee49988900ab45c555eedb
i4xIITBombayXCS1011xchapter969bb3eebf8f4b2492275af0a6e5be08
i4xIITBombayXCS1011xchapter1fad372f8d384edfa807b723851f4444
i4xIITBombayXCS1011xchapterdb831bde971143418ea3ad392fe223f8
i4xIITBombayXCS1011xchaptera0bcbaa98fb343e5bd5f7d8a7ea1cd86
i4xIITBombayXCS1011xchapterbc8f4e5d7e394bec93b07c529639ca41
i4xIITBombayXCS1011xchapterdeb4368aa1ee4090ba459f75c3bc60db
i4xIITBombayXCS1011xchapter177f11e1592b4e42a01d217957b7a5a8
i4xIITBombayXCS1011xchapter52216db258c84bf5a4560768546ca8e1
i4xIITBombayXCS1011xrsechapter5334110e69e04480bddc3238a9beac64
i4xIITBombayXCS1011xchapter1e8dd3ac589a4096a42497db8cd48b74
]
metadata
course_image IITBx-CS101x-Course-Bannerjpg
days_early_for_beta 40
discussion_topics
General
id i4x-IITBombayX-CS101_1x-course-2014_T1
display_name Introduction to Computer Programming Part 1
end 2014-09-20T063000Z
graceperiod
mobile_available true
start 2014-07-29T063000Z
Child field represents the list of object identifiers of chapters(weeks quizzes and ex-ams for CS1011X) in course Find 32 characters object identifiers of the chapter usingOpen edX front end architecture information Then find the match for retrieved 32character chapter identifier in child list of the course object in course structure file filealso contains details about chapter identifier search for the selected chapter identifierfrom the child list the details(data) of particular object identifier will be found for
19
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
example suppose we have to find identifier for week 1 first find 32 characters identifierfrom URL information of week 1 it will match with the child list of the course ierdquoi4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402rdquo Searchfor it then get another instant of the string which contains the details about the Chap-ter(Week 1)
i4xIITBombayXCS1011xchapter5773a6ab6319440aa5889ae38e578402
category chapter
children [
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
i4xIITBombayXCS1011xsequentialf0c2821e384a4a85b44a07575129b755
i4xIITBombayXCS1011xsequentialeeea132794aa4e0481e94a069cab3e10
i4xIITBombayXCS1011xsequential06713e09244b462ca480e03ce88bb6a3
i4xIITBombayXCS1011xsequentialba24809f572846458e1c78e09411863a
i4xIITBombayXCS1011xsequential97395d60fa094ff8a17770fa89739dce
i4xIITBombayXCS1011xsequential5d5f32c8ab204f2a95c34b07bf276de5
i4xIITBombayXCS1011xsequential93c7eb88973146489f83cc1fd855a9f7
]
metadata
display_name Week 1 Procedures programs and computers
start 2014-07-29T063000Z
Do same procedure to find details of sequence identifier
i4xIITBombayXCS1011xsequential6adae97399e5443c9560a2bd02a10314
category sequential
children [
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
i4xIITBombayXCS1011xvertical7021ad808a4241edbf23d2c272758e71
i4xIITBombayXCS1011xvertical05d5ebdf1007427aaa31792a2ec262a1
]
metadata
display_name Session 1 Preamble
start 2014-07-29T063000Z
Now can find a unit(vertical) in sequence using number of vertical in sequence panel
i4xIITBombayXCS1011xvertical27e5e0c3292241b0a29b1f57ee60ad4b
category vertical
children [
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
]
metadata
display_name Preamble
20
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
And finally can find different object identifiers for different module in the panel Inabove example it contains the object identifier for video module located in panel 1 of firstsequence(topic) in week 1(chapter 1) in CS1011X course
video module is here
i4xIITBombayXCS1011xvideo641407821f3a44ee9530a3ea900e2e80
category video
children []
metadata
display_name Preamble
download_track true
download_video true
edx_video_id IITCS1012014-V005000
html5_sources [
httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaymp4
]
sub UaB1KqHWX-k
track httpss3amazonawscomCS101xCS101x_S001_Preamble_IIT_Bombaysrt
youtube_id_0_75
youtube_id_1_0 UaB1KqHWX-k
youtube_id_1_25
youtube_id_1_5
similarly we can find different modules located in courseware hierarchy and thesemodule object identifiers can be used to gain more insight on user behaviour on theirresources use and courseware interaction using the interaction log data and coursewarehierarchy knowledge
Data packets also contains data about discussion forum data and wiki data For moredetails see [23] In next cahpter we will see more on Events in Tracking Logs
21
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Chapter 5
Events in the Tracking Logs
In this chapter we will see the event in tracking logs that are important to our work formore details see [23] Events are emitted by server browser or mobile and are stored injson document Delivered data packets contains event data in log files
There are two major types of events includes student interaction events generated bystudent interaction and instructor interaction events initiated by course team membershere we will cover event required for our work from student interaction for more detailssee [23]
51 Common Fields
bull context FieldThe context field includes member fields that provide contextual information Typeof the field is dictionary It has following important member fieldscourse id -Identifies the course that generated the eventorg id -The organization that lists the coursepath -The URL that generated the eventuser id -Identifies the individual who is performing the action
bull event Field event field is of type dictionary contains members that identifiesspecifics of each triggered event the member fields are different for different eventfields More information about the event member fields are described for each eventlater in this section
bull event source Field Specifies the source of the interaction that triggered the eventlsquobrowserrsquo lsquomobilersquo lsquoserverrsquo lsquotaskrsquo are different values for the field
bull event type Field There are many types of event emitted in student interactionout off we will see required event types in detail in section
bull page Field Field contains the URL of the page the user was visiting when the eventemitted
bull session Field This 32-character value is a key that identifies the userrsquos session allbrowser events contains value for the field server events do not contain session field
22
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
bull time Field Gives the UTC time at which the event was emitted in lsquoYYYY-MM-DDThhmmssxxxxxxrsquo format
52 Student Events
This section lists the events that are logged for interaction of students with LMS
bull Enrollment Events
bull Navigational Events
bull Video Interaction Events
bull Textbook Interaction Events
bull Problem Interaction Events
bull Library Interaction Events
bull Discussion Forums Events
bull Open Response Assessment Events
bull Third-Party Content Events
bull Testing Events for Content Experiments
bull Student Cohort Events
bull Open Response Assessment Events (Deprecated)
The value in event source filed distinguish between the events that originates inbrowser and server Any additional member fields for event contains in common contextor event fields
Events belongs to Navigational Events Video Interaction Events Problem InteractionEvents Discussion Forums are the important events for our work
521 Navigational Events
This section includes descriptions of page close seq goto seq next seq prev events
bull page closeThe page close event is browser event and originates from within the JavaScriptLogger itself
bull seq goto seq next and seq prevThe browser emits these events when a user selects a navigational control to switchbetween panel on same sequence
ndash seq goto is emitted when a user jumps between panel in a sequence
ndash seq next is emitted when a user navigates to the next panel in a sequence
ndash seq prev is emitted when a user navigates to the previous panel in a sequence
event field contains information about edX module id for sequence and panel numberfor new and old
23
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
522 Video Interaction Events
Video Interaction contains following events The list represent original names present inevent type field for all events is followed by a newer name in name field for events thathave an event source of lsquomobilersquo all the events are emitted by browser or edX mobileApp
bull hide transcriptedxvideotranscripthidden
bull load videoedxvideoloaded
bull pause videoedxvideopaused
bull play videoedxvideoplayed
bull seek videoedxvideopositionchanged
bull show transcriptedxvideotranscriptshown
bull speed change video
bull stop videoedxvideostopped
Out of all these events we are capturing play video pause video seek videostop video etc Our purpose is record the time for which student watch the video wecapturing member field from event field contains video id currenttime and for seek videoit oldtime and newtime Location about the video is available in page field described incommon fields
To record students complete courseware interaction Textbook Interaction Events arealso play an important role but unfortunately these events are not recorded in our instanceof CS1011X course so we will see other trick to identify whether student read thetextbook or not
523 Problem Interaction Events
Following are the different problem interaction event types
bull problem check (Browser)
bull problem check (Server)
bull problem check fail
bull problem reset
bull problem rescore
bull problem rescore fail
bull problem save
bull problem show
bull reset problem
24
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
bull reset problem fail
bull show answer
bull save problem fail
bull save problem success
bull problem graded
Out of all these we are recording only Problem chek (server)show answershowanswer problem show is also one of the important event torecord when student visited the problem but it do not works as defined instead there isanother event lsquoproblem getrsquo emitted by server when user visit the problem lsquoproblem getrsquois not mentioned in Problem Interaction Events in research doc for edX
show answershowanswer event is recorded along with problem id field in event fieldto identify the problem problem check is important event to record Each problem cancontains multiple subproblems we are recording the correctness for each subproblemgrades and attempt number see research doc for more details
524 Discussion Forums Events
Contains following event types
bull edxforumcommentcreatedWhen a user creates a new thread the server emits an event
bull edxforumresponsecreatedWhen a user responds to a thread the server emits an event
bull edxforumsearchedWhen a user search to a thread the server emits an event
bull edxforumthreadcreatedWhen a user adds a comment to a response the server emits an event
MongoDB discussion forums database data also contains discussion forum data that isincluded in the weekly database data files
53 New Findings in tracking logs
edx platform emits large number of events all event types are not documented in edxresearch doc Most of these events emitted by the server during user interaction withcourse We need to record all these events to make concrete conclusion on interactiondata Here we will see what are the different types of event are useful and those are notdocumented in edx research doc
1 when user navigates in courseware from one sequence to another or from one weekto another week no browser event generates when user moves from one page toother browser emits page close event for old page that has closed but there is nosuch event like page open or page visited when new page opens These kind of
25
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
events are emitted by server but not with some fixed event type value field Theseevents are named by URL of the new page openedExampleevent_typecoursesIITBombayXCS1011x2T2014courseware
db831bde971143418ea3ad392fe223f89f85ddb58c414acb8497258d81b780f1
In above event user opens sequence with object identifierlsquo9f85ddb58c414acb8497258d81b780f1rsquo in chapter with identifierlsquodb831bde971143418ea3ad392fe223f8rsquoSo it is possible to record these kind of event to gain knowledge about usermovement in the courseware We records this event with lsquopage openrsquo along withlsquouser idrsquo rsquotimersquo and information about page opened which is present in event fielditself
2 Similarly there is no browser event about when user visit the forum There aremany different events are emitted by the server about the discussion forum Fordiscussion there already a events for create threadcreate comment create reply andsearch in the forum but there is no single event about visit to discussionServer event types for forum visit are different like above example Here are someexamples
bull event_typecoursesIITBombayXCS1011x2T2014
discussionforum6be6272165ce4faaa22c4beed2ab8320threads
53f7430d2b8b56be8700008f
This example represents user browsing the discussion forum which has objectidentifier represented by lsquo6be6272165ce4faaa22c4beed2ab8320rsquo using thisidentifier we can locate this forum in courseware hierarchy using coursestructure
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumi4x-IITBombayX-CS101_1x-course-2014_T1threads
53ea151e86d32909610013e3
This event indicates browsing in main discussion page on course page
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumfc937988e5094bd38c368e38510f57fbinline
Inline some discussion
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f759442b8b5605e50000b8reply
Reply to some thread
bull event_typecoursesIITBombayXCS1011x2T2014discussion
comments53f766ee2b8b561d050000a2update
Upvote to comment
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumsearch
search in discussion forum
bull event_typecoursesIITBombayXCS1011x2T2014discussion
threads53f6d42f2b8b56b08000163aflagAbuse
bull event_typecoursesIITBombayXCS1011x2T2014discussion
forumusers2220followed
26
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Chapter 6
Tools and Technologies
Our project aims to make contribution to development of Intelligent Tutoring System forMOOCs Development and analysis of the project is carried out with some of the tooland technologies available in open source Here we discuss a brief introduction of someof the tools and technologies used
61 Primary requirements for development
1 PythonPython is a high level programing language Python programs can be written inobject- oriented as well as functional style Due to large number of libraries it isvery convenient to use python for developmentIn our work for processing of the log data is done using python programing languageWork includes data cleaning feature extraction and clustering on the features
2 JavaSimilarly java is a high level programing language and large number of tools andlibraries available in java it is reasonable to use java programingDue to availability of tool for bayesian network in java we use java for developingStudent model using Bayesian Network
3 MySQLMySQL is a open source Database management system It is most widely useddatabase software used for most of the database applications We used MySQL forbackend for all the database application in our system
4 phpMyAdminphpMyAdmin is used to handle administration of entire MySQL database php-MyAdmin is a free software tool written in PHP For installing phpMyAdmin weneed to install web server (Such as LAMP(LinuxApacheMySQLPHP)) with thiswe can access the interface with web browserFeatures of phpMyadmin are
bull Create copy drop rename tables columns and indexes
bull Browse and drop database tables views etc
bull Import the text files into tables
bull export data to various formats csv xml pdf etc
27
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
bull it will support the InnoDB tables and foreign keys
62 Python Bayes Network Toolbox
A general purpose Bayesian Network Toolbox With this library it is possible to input aBayesian Network with probabilitiesconditional probabilities on each node to calculatethe marginal and conditional probabilities of queries on the network The tool is availableat httpsgithubcomthejinxterspbnt
The tool is erroneous Initially tried with this toolbox to implement Bayesian Networkin python but it do not estimate the Posterior probabilities exactly So shift to the similartool available in java
63 Jayes - Bayesian Networks for Java
Jayes [1] is a open-source pure-Java bayesian network library for Bayesian networks andthe inference in such networks we will see the short review how to useThe BayesNet is a container for BayesNodes which represent the random variables of theprobability distribution you are modelingA BayesNode has outcomes parents and a conditional probability table It is importantto set the probabilities last after setting the outcomes of the parent nodes The followingdiagram shows the workflow
Figure 61 Bayesian Network implementation steps [1]
for more deatils seehttpwwwcodetrailscomblogintroduction-bayesian-networks-jayes
Used this tool for implementing bayesian network for student module
28
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
64 moocdb
The MOOCdb project developed by the ALFA group at Massachusetts Institute ofTechnology this aims to address the limitation of MOOC data science by providing astandard data model to enable MOOC data science at larger scale Some fundamentalactivities can be identified in any online learning environment In online environmentlearner use resources submit work for assessment and collaborate in forums Based onthese general behaviors there should be a model to capture traces of online learningactivities moocdb model address this challenge and transfers moocs clickstream data toa relational database[24][25]
Advantages of moocdb
bull The benefits of standardization
bull Concise data storage
bull ccess Control Levels for Anonymized Data
bull Savings in effort
bull Crowd source potential
bull A unified description for external experts
bull Sharing and reproducing the results
The schema organizes the information in terms of 4 different behavioral interactionmodes as follows [25]
1 Observing
2 Submitting
3 Collaborating
4 Feedback
Most likely and used tables are in Observing and Submitting modes In observing modesstores data about student observation and browsing of resources in the course theresources includes wiki forums lecture videos homeworks book tutorials Submittingmode records student interactions with the assessment modules of the course
For log data processing and cleaning tried moocdb We translated the CS101x dataevent log data into relational database for analysis and experimentation using moocdbThis data is pretty useful for Analytics purpose but not usefull for our work moocdb donot properly maintain interaction information of each enrolled user in course with theirstudent id It uses auto generated userids for each user interaction so it is difficult toidentify the particular user in database Another issue was they separate observing andsubmitting modes so it is difficult to find out activity sequence of each user in databaseDue to disadvantage of using moocdb model we are not using the model for our project
29
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Chapter 7
Experiments With Data
71 Pre-Processing and Data Cleaning
Processing the record of every user interaction in entire course can gain more insightsinto learners behaviour But finding meaningful interpretation from clickstream data ischallenging task For deriving meaningful observation requires the raw interaction tracesto be transformed
Identification of important events in such huge data is important as specified beforewe are considering some of the important events from this row data to draw meaningfulinterpretation from clickstream data The events considered for our work are from videointeractions navigation events discussion and problem interaction events etc
Our work is based on modeling of the user interaction behaviour in week duringsolving quiz problem So need to identify the object identifier from course structure dataor from URL string for that week represents identifier for that week as seen before inedX front end architecture Along with 32 characters long identifier for week resources(chapter) we also need to identify the problem around which user was using theseresources from chapter quizzes in the course CS1011X are hide from the course afterthe due date So only place to find problem identifier and quiz page identifier is coursestructure file
Video interaction are captured for all the modules those are located within theresources of the current week using page information in video interaction events Similarfor the navigation events only those events are captured are belongs to current weekpages Discussions forum interactions are captured only for those events whose discussionmodule identifier are located in current week in courseware hierarchy For this requiresto precapture the list of discussion object identifiers from course structure those arelocated in current week Problem interactions are captured only for the problem whoseproblem id belongs to current week
Once all this done we process the data for all users those are using the resources ofspecified week and participating in quiz for the week the data recorded are in particularperiod of the course during the time course week started upto the due date of the quiz ofthe week All the interactions data processed for the users those participated in the weekresources or in quiz for the for listed events Different kinds of data is recorded based on
30
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
event type of the data
72 Feature Extraction
Data cleaning extract valuable data from clickstream row data to draw meaningful inter-pretation Even Though it is not directly possible to apply this data for interpretationprocess So first we need to identify the characteristics of the database and extract fea-tures out of it Extracted each feature represents one parameter of the student behaviourBy applying the data mining process on couple of parameters(features) it is possible togain valuable information from dataThe different Features we identified in data are as follows
1 Time spend for watching video on a page and time spend with watching single videoFor each video interaction event page and video module id is available in data
2 Time spend on each sequence pageAs we have seen we have recorded the page open and page close events with theseevents it is possible to find time spend on each page sequence
3 PDF used on pageTextbook event are not recorded in our instant of CS1011x offering so it is hardto find this information we track the navigation of user for this information butthis do not reflect much more precise information about book used
4 Total time spend for watching all videos in weekIt is aggregate of time calculated for each video calculated before
5 Total time for user active in weekIf the time between two consecutive interaction is less than 20-30 minutes then itassumed that user was active in time between these action are summed into totaltime
6 Total number of slides used in week resourcesIt is aggregate of slides used for every sequence
7 Total attempts for quiz and marks obtain in quiz for each attempt Attempt andmarks information are extracted from problem check event for that question
8 Time between two consecutive quiz attemptsTime difference between two attempts recorded for the question
9 Prior Knowledge using previous quizzes markThis information is possible to find using database data from student progress datain courseware studentmodule table for problems from previous quizzes problem idsare find using the course structure file for all quizzes prior to this week
10 Navigation from Quiz to Resources Quiz to Discussion Resources to DiscussionThese navigation are recorded from all student interaction for each student arrangedin ascending order of time using current and previous state of student in courseware
31
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Interaction logs available did not captured textbook interaction in course so it is notpossible to draw meaningful and precise conclusion about the student behaviour withavailable data for CS1011X course Observed shows that most of the student do notwatching the videos but participating in quizzes (might be doing some other exercise liketextbook from course or any external reference material reading offline) and obtains goodmarks in quiz
73 Clustering Experiments and results analysis
As we have seen in literature review about different techniques in data mining out ofclustering clustering is one of the most prominent technique used for student modeling ineducational data mining The data is collected in the form of feature vector in which eachvector represents data of one student with different features Clustering is performed togroup the similar behaviour student and to discover patterns in the student interactionbehavior Clustering works by grouping feature vectors by their similarity( based on theEuclidean distance between feature vectors)
There exist numerous clustering algorithm each has some advantage and disadvan-tages We chose k-means because it is simple and work perfectly with our dataset ieour aim is to minimize the euclidean distance between feature sets Other advantage ofk-mean is time complexity is linear in the number of feature vectors
In this section we will see the clustering experiment with various features and resultanalysis of the formed clusters Before performing clustering first standardised (prepro-cess) the data ie Scaling and normalisation of the feature vector The feature vectorused for experiment are with different units of measure so it is recommend to standardisethe data for better results k-mean uses the euclidean distance so if we do not stan-dardise the data it will put more weight on the features with large variance or large range
The analysis of the results give more insight into different student learning behaviourthat can lead to take better decision for educators and researcher for feature offering of thecourse Analysis of student learning behaviour can also be used for personalized guidancefor each group of student
731 Experiment 1
Clustering on the basis of resource utilisation time spend with success in quizCluster Features
bull Total time spend in week for watching videos
bull Total time spent in courseware in week
bull Number of slides used in resources from week If heshe doesnrsquot watch the videothen heshe might have used the slides(book) so it is good to consider along withvideo watch time
bull Marks obtained in quiz after using all these resources in the week
Number of Clusters 10
32
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Cluster Video watchtime
Total timespend
N0 of PDFused
Marks N0 ofstu-dents
C1 602625641e+03 145086923e+04 330769231e+00 164102564e+00 39C2 177948276e+02 130930172e+03 137931034e-01 411206897e+00 232C3 240779661e+02 765304237e+03 861016949e+00 673728814e+00 118C4 967165354e+01 711228346e+02 708661417e-02 105511811e+00 127C5 524980000e+03 368727250e+04 502500000e+00 582500000e+00 40C6 568029167e+03 140809583e+04 813541667e+00 631250000e+00 96C7 136237234e+03 113920851e+04 101063830e+00 645744681e+00 94C8 989570552e+01 991492843e+02 118609407e-01 677096115e+00 489C9 385051724e+02 806713793e+03 860344828e+00 370689655e+00 58C10 571765537e+03 118892655e+04 106779661e+00 636158192e+00 177
Table 71 Clustering Experiment 1 results
Result Analysis
Different cluster centroid represented in table 71 here we will see what each clustercentroid represents about student behaviourC1 represents 39 students used resources and spending time but could not obtain goodmarksC2 reprsents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course usedslides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marksC9 represents 58 students spend most of the time on slides achieved average marksC10 represents 177 students spend lot of time watching videos total time in course withgood marks
With this analysis it is possible to provide personalised learning for each group of userfor better learning This can also used to find how students learns and resources use
732 Experiment 2
Clustering on the basis of resource utilisation time spend and prior knowledge withsuccess in quizCluster Features Same as in Experiment 1 along with prior knowledgeNumber of Clusters 10
33
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
ClusterVideo watchtime
Total timespend
N0 of PDFused
Prior Marks N0ofstu-dents
C1 301564748e+02 286328058e+03 248201439e-01
575282631e-01
575899281e+00278
C2 417294118e+03 378787059e+04 420588235e+00 718137255e-01
614705882e+0034
C3 246416667e+02 101316667e+03 136363636e-01
804473304e-02
490909091e+00132
C4 311176000e+02 724076000e+03 838400000e+00 737333333e-01
662400000e+00125
C5 632172549e+03 155625490e+04 354901961e+00 464052288e-01
223529412e+0051
C6 475035088e+02 878600000e+03 871929825e+00 441311612e-01
382456140e+0057
C7 539508466e+03 120893968e+04 111111111e+00 690161250e-01
645502646e+00189
C8 134079412e+02 147884706e+03 123529412e-01
875490196e-01
690000000e+00340
C9 574762105e+03 155007053e+04 820000000e+00 701002506e-01
635789474e+0095
C10 149159763e+02 829520710e+02 130177515e-01
354888701e-01
152071006e+00169
Table 72 Clustering Experiment 2 results
34
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Cluster Marks in firstattempt
Marks in sec-ond attempt
Total timespend afterfirst attempt
No of visitsto forum afterfirst attempt
N0 ofstu-dents
C1 379249012e+00 488142292e+00 445652174e+01 177865613e-02 506C2 700000000e+00 700000000e+00 274910000e+04 110000000e+01 1C3 627525253 0 0 0 396C4 597948718e+00 700000000e+00 613717949e+01 333333333e-02 390C5 327272727e+00 554545455e+00 57848101e+01 126582278e-02 11C6 015 0 0 0 80C7 822784810e-01 296202532e+00 257848101e+01 126582278e-02 79C8 5 628571429 162671428571 228571429 7
Table 73 Clustering Experiment 3 results
Result Analysis
See table 72 for different cluster centroid results in which each cluster represents a groupof students with similar behaviourThis experiment also shows similar behaviour like experiment 1 except we have addedprior knowledge new feature If we compare prior knowledge of the students with thesuccess in the quiz it clearly reflects in the results that students are performing good ifthey have superior prior knowledge
733 Experiment 3
Clustering based on student behaviour between two attempts of the question using theirtime spend and forum visit between two attempts Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull Total time spend after the first attempt of the question
bull forum visited to find help after first attempt of quiz
Number of Clusters 8
Result Analysis
See table 73 for different cluster centroid resultsC1 represents 509 students attempted second attempt without spending time in course-ware and visits to discussion forumC2 represents 232 students did not utilize the resources and spend time even thoughachieved average marks in quizC3 represents 118 students used only slides in course and achieved good marksC4 represents 127 students did not utilize the resources and not spend time with poorperformanceC5 represents 40 students spend lot of time watching videos much more total time incourse used slides and achieved good marksC6 represents 96 students spend lot of time watching videos total time in course used
35
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Cluster Marks in firstattempt
Marks in sec-ond attempt
Time betweentwo attempts
N0 of stu-dents
C1 048407643 143949045 40991719745 157C2 414285714e+00 580952381e+00 202791381e+05 21C3 379607843 489411765 178796666667 510C4 596042216 7 293036675462 379C5 627525253 0 0 396C6 685714286e+00 700000000e+00 477100714e+05 7
Table 74 Clustering Experiment 4 results
slides and achieved good marksC7 represents 94 students utilize very less resources and spend less time even thoughachieved good marksC8 represents 489 students did not utilize the resources and did not spend time eventhough achieved good marks
734 Experiment 4
Clustering based on time spend between two attempts of the question to find user tryand error behaviour Cluster Features
bull marks in first attempt of the quiz
bull marks in second attempt of the quiz
bull time between two attempts
Number of Clusters 6
Result Analysis
See table 74 for different cluster centroid results Cluster C2 and C2 spend significantamount of time and there is good improvement in their results C1 represents studentsattempting try and error with very poor performance C3 and C4 represents they madesmall mistakes in first attempt and corrected in short time in second attempt C5 studentattempted only one attempt and achieved good result in first attempt only
735 Experiment 5
Clustering based time spend in courseware visits to the forum and marks in quiz finduser help seeking behaviour with their success in quiz Cluster Features
bull Time spend in courseware
bull Number of time visited to forum
bull marks in quiz
Number of Clusters 4
36
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Cluster Time spend incourseware
visits to fo-rum
marks in quiz N0 of stu-dents
C1 456288907e+03 502919099e-01 600917431e+00 1199C2 455756923e+04 172307692e+01 676923077e+00 13C3 225484416e+03 370129870e-01 974025974e-01 154C4 231444135e+04 478846154e+00 578846154e+00 104
Table 75 Clustering Experiment 5 results
Result Analysis
See table 75 for different cluster centroid results C2 and C4 spend significant amountof time in courseware frequent visists to forum and achieved good marks C3 performsvery poorly and they look at forum for help and not significant time with courseware C1
spend less time and rear visits to forum but achived good marks
37
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Chapter 8
Student modeling using BayesianNetwork
81 Student Modeling
The goal of an ITS is to adapt instruction as per individual knowledge level for eachstudent Student model is the key component that allows personalised instruction to beadapted to each student Student model contains all information about student currentstate of knowledge that allows ITS to provide the personalised tutoring An ITS collectdata from various sources for creating and maintaining student model Sources includeobserving student interaction quizzes and exams or from explicitly request from user[26]
811 Information in Student model
Student model contains important information about student such as domain knowledgepreferences interest goal tasks learning style background etc Content of the studentmodel can be divided into domain specific and domain independent information [27]
bull Domain Independent InformationContains learner goals interests background and experience individual traits ap-titudes and demographic information
bull Domain specific informationShows the status and degree of knowledge and skills student achieved in certaindomain It is organized as in the form of knowledge model that has many elementswhich student needs to learn User knowledge is the most commonly used feature inmost of the adaptive education systems for student modeling some of the modelsused to represent knowledge of the domain are vector model overlay model andfault model
ndash Vector model is the simplest form of knowledge representation The vectorconsists of concepts topics or subjects in domain Each element of the ofthe vector is a real number or integer represents degree which learner gainsknowledge about that concepts topics or subjects
ndash Overlay model- To represent an individual userrsquos knowledge as a subset ofthe domain model is the basic Idea of overlay knowledge modeling In domainmodel each element represents a concept subject or topic So the structure of
38
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
user model is constructed from the structure of domain model Each elementin user model has a specific value measuring the userrsquos knowledge about thatelement corresponding to each element in domain model This value is consid-ered as the mastery of domain element and represented in the form of binary(knows or ignores) qualitative (good average weak etc) or quantitative (theprobability of knowing or not a real value between 0 and 1 etc)[28]
Figure 81 Overlay model
Overlay modeling approach is based on domain models which is constructedas knowledge network The complexity of the model depends on the DomainModel structure and on the estimate of the student knowledge which is ac-quired through assessments and analysis of the studentrsquos interaction with thesystem Fig 81 represent simple overlay model in which number indicatesmastery of the concept on the scale of 10
ndash Fault model- Unlike vector model or overlay model fault model is used todescribe the lack of learnerrsquos knowledge The model contains learners errors orbugs Information from fault model can be used to deliver learning materialconcepts subjects or topics that users donrsquot know
812 Uncertainty-Based User Modeling
The state of knowledge is generated from student behavior during interaction with thesystem ie inferred from information available for each student Diagnosis is the processthat infers student knowledge state from the available student information Diagnosisis the most complicated process in an ITS Due to uncertain (we are not sure thatavailable information is absolutely true) andor imprecise information(the values are notcompletely defined) treatment of inferring knowledge state from such information isdificult process Suppose for example student fail to answer question so most probablystudent doesnrsquot know the concept is uncertain information and observation of studentreading about concept is imprecise observation to estimate student knowledgetheseissues are addresed in [26]
Artificial Intelligence has addressed this problem in various ways such as rule-basedsystems fuzzy logic Bayesian Networks etc Bayesian networks are the most powerfulapproach for uncertainty management in Artificial Intelligence This technique combinesthe rigorous probabilistic formalism with a graphical representation and ecient inferencemechanisms Due to sound mathematical foundations and natural way of representing
39
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
uncertainty using probabilities Bayesian networks (BNs) used by many system designerfor uncertainty management[26] [29][30]
82 Bayesian Diagnostic Algorithm for Student Mod-
eling
In this section we will see approaches to implement student model for more accuracy indiagnosis of student knowledge at every conceptual level The student model defined isan overlay student model that is the studentrsquos knowledge is considered as a subset ofthe expertrsquos knowledge
821 Bayesian Network
A Bayesian Network is directed acyclic graph in which nodes in graph represents vari-ables and arcs between nodes represents probabilistic dependency among variables Theparameters used to represent uncertainty are the conditional probabilities of node givenits parents if the variables of the network are Xi i = 1 N and parents of the node Xi
represented by Pa(Xi) then the the parameters of the network are the conditional prob-ability distributions P (XiPa (Xi)) i = 1 n for each variable given itrsquos parent(forthe nodes without parents the a prior distributions) This conditional probabilities de-scribes joint probability distribution of the network distribution for the entire networkas
P (X1 Xn) =nprod
i=1
P (XiPa (Xi))
To define Bayesian Network need to specify following things
bull The set of variables X1 X2 Xn
bull The set of relationship(arc) among those variables The arcs represent casual in-fluence among the variables and network form with these arc and variables shouldnot contain any cycle
bull For each Xi the conditional distribution given its parent isP (XiPa (Xi)) i = 1 n
Once a BN is created it can be used to make inferences about the values of thevariables in the network Bayesian propagation algorithms make such inferences usingavailable information using set of observations or evidences This process is sometimesreferred to as beliefs update After beliefs update a posterior probability distributionreflects the influence of evidence for each associated variables Inference are made fordiagnostic or predictive Diagnosis is for identifying most likely cause given a set of evi-dence Evidence in that context can be referred a symptoms or fault On the other handPrediction is made to identify the most likely event occurrence given a set of observations
In a BN every variable can be either a source of information (if its value is observed)or object of inference (given the set of values that other variables in the network have
40
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
taken) [15]
We are using Bayesian Network to define student model in which variables are usedto represent concepts problems and auxiliary node The link between variables representsrelationship between nodes After that conditional probabilities must be specified Onceeverything is setup Bayesian Network can be used to make inference about the values ofthe variables in the network Observations or evidences are used as a available informationto make the inferences using probability theory
822 Bayesian Student Modeling
Overlay modeling approach is build using Bayesian Network in which knowledge nodein the network corresponds to the concept in domain model In this section we will seethe nodes dened for the Bayesian student modeling and relationship between nodesSpecifying conditional probabilities is the most difficult task in defining Bayesian networkThe techniques used in [31] for specifying conditional probabilities simplify the process
Implementation of bayesian network for student modeling to manage uncertainty hasalready defined in [8][32][33] Our work is extension to work defined in [31] to manageuncertainty for estimation of student knowledge Existing model do not consider eachproblem with different level of difficulties So after the belief update estimated knowledgesimilar for on evidence of easy or hard problem Our approach adds one more level to theexisting model In our approach instead of direct relation between concept and evidencewe are introducing auxiliary node Conditional probabilities defined between auxiliarynode and evidence are depends on difficulty index of the problem The specification thethe network is defined below
8221 Nodes and relationship among nodes
bull To dene Bayesian student model the nodes considered are
1 Evidential nodes evidential nodes are the problems in quiz assignmentsexams etc are answered correctlyincorrectly which is represented by E Thesenodes are used to collect the information relevant to the studentrsquos knowledgestate
2 Knowledge nodes Which are defined at three different levels of granularitywhich consist of concepts topics and subject are represented by C T and Arespectively Different level of granularity provides detailed information aboutthe student knowledge level as required by the ITS This kind of structureenables system to understand exactly which part of the subject masterednotmastered by the student
3 Auxiliary nodes These nodes are used for modeling convenience only Forsimplifying the process of specifying conditional probabilities between conceptand evidence node Auxiliary nodes used are represented by B in network
bull Relationship among those nodes are
1 Aggregation relationship As mentioned above each knowledge node has aconditional probability distribution given its parent The aggregation is existonly between knowledge nodes in granularity hierarchy
41
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
2 Relationship among concept and auxiliary nodes It is just like aggre-gation relationship exist between concept and auxiliary nodes It representinfluence of the related concept weight on which the relation is defined
3 relation between auxiliary nodes and evidence Mastering not master-ing the node has causal influence in answering related questions correctly Inthis relationship auxiliary node represent mastering of the associated concepts
Figure 82 Bayesian Network model
The Bayesian Network dened is depicted in Figure 82 It is divided into two partswhich overlaps in concept nodes
bull Diagnosis process is performed by a part which contains concept nodes and testitems At this stage the set of concepts mastered by the student are inferred fromstudentrsquos answers to the test items
bull Evaluation process contains knowledge nodes In which from the probability of know-ing each concept(obtained in previous stage) the state of knowledge of each topicestimated And from state of knowledge of topics the knowledge of subject isestimated
8222 Parameter specification
Once the model has been dened need to specied required parameters It is one ofthe most difficult problem for using Bayesian Network Next we will see the proposedapproaches for the parameter specification
1 The prior probability of knowing the concept Prior probability of mastering theconcept can be specified from the information available about the particular studentif information is not available the uniform distribution is used Information consistof accessing related material on concepts time spend etc
2 Conditional probabilities of each knowledge node given its parents Conditionalprobabilities are computed on the basis of importance of each knowledge node in
42
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
aggregated knowledge node The importance of each knowledge node is decided byweight assigned to that node
bull Let Cij i = 1 nj be the set of related concept in topic Tj and wij rep-resent the importance of concept Ci in topic Tj i = 1 nj Then for eachS sube i = 1 nj required conditional probability distribution is given by
P(Tj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
bull Let Ti i = 1 s be the set of related topics in subject A and αi representthe importance of topic Ti in subject A Then for each S sube i = 1 srequired conditional probability distribution is given by
P(A(Ti = 1iisinS Ti = 0i isinS
))=
sumiisinS αisumSi=1 αi
Aggregation relationships are used to obtain information about how well studentmastered topic or subject By using this network it is possible to obtain estimationof subject proficiency or understanding of each topic and this information allowsITS to individualized tutoring for any topic where student lack
3 Conditional probabilities of auxiliary node given related concepts Conditional prob-abilities are computed on the basis of importance of each concept in aggregatedauxiliary node The importance of each concept is decided by weight of the conceptin associated test item with auxiliary nodeLet Cij i = 1 nj be the set of related concept in question associated with givenauxiliary node Bj and wij represent the importance of concept Ci in auxiliary nodeBj i = 1 nj Then for each S sube i = 1 nj required conditional probabilitydistribution is given by
P(Bj
(Cij = 1iisinS Cij = 0i isinS
))=
sumiisinS wijsumnj
i=1 wij
4 For relationships among auxiliary node and test items the parameters need tospecify are the conditional probability of correctly answering questions given relatedconcept and it is depends on difficulty level of the test item fixed set of conditionalprobabilities are defined for each level of difficulty
83 Experiments and Results
As described above we have implemented Bayesian network model for student modelingTopics and subject nodes represents the understanding at different levels For experi-mentation and for deciding the concept-wise understanding of student we do not needthis part of the model So implemented model ignores top two level of knowledge nodeand considers concept as a top level knowledge nodes
In next section we will see the results on implemented bayesian network model Inthis experiment for specification of bayesian network we use CS1011X course dataFrom the network we have created student model for each student participated Student
43
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
model builded on bayesian network reflect the understanding of the student concept wiseknowledge in the course from data collected from their responses to the final exam
831 Nodes and relations
For specification of network we are considering three different kinds of nodes includes con-cept nodes (Knowledge nodes) axillary nodes and evidence or problem nodes Conceptnode are created for the each concept defined by the instructor auxiliary nodes are asso-ciated with evidence so for each evidence node there will be one auxiliary node Evidencenodes are the problems defined by instructor for evidence There can be any other kindof evidence like time spend or resources utilised In this experiment we are consideringproblem from final exam as a evidence in network As described above there is relation-ship between concept and auxiliary nodes and axillary nodes and evidence nodesDifferent concepts identified in CS1011x for implementing student model for CS1011xcourse are as follows
bull Introduction
bull Representation of numbers and string
bull Types and names of variables
bull Assignment statement and arithmetic and logical execution
bull Iterative solutions
bull Functionsparameter passing and recursive functions
bull Arrays and matrices
bull Sorting
bull Searching
bull Character string
bull Pointer and pointer in function
832 Parameter specification
bull Prior probabilities for concept nodes as defined by instructor In our experiment wedefine 05 for each concept node
bull Conditional probability between concepts and auxiliary node depends on weightof the concepts in corresponding problem associated with that axillary node Theweight is defined by the instructor We have defined weights for all problems weconsidered for evidence
bull Conditional probability between auxiliary node and evidence node is depends ondifficulty level of the problem Difficulty of the problem estimated with responsescollected from students for that problem In our case we have considered threedifferent levels of difficulty
44
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
1234 students participated in final exam in course The exam has 30 questions coversmost of the concepts listed above In this section we will see the belief update with studentresponses for evidences Belief update in concept node represents the knowledge of thestudent for that concept Below we will see the different concepts and understanding ofstudent for those concepts using Bayesian network student model with evidence obtainedfrom their responses to the questions
0
200
400
600
800
1000
Iterative solutions
Functionsparameter passing and recursive functions
Arrays and matrices
Character stringPointer and pointer in function
Num
ber
of
students
Concepts
Analysis of students understanding of concepts in CS1011x
Marksgt9090gt=Marksgt7070gt=Marksgt50
Markslt=50
Figure 83 Analysis of student understanding for concept in CS1011x
Graph 83 represents the knowledge of students for concepts from course The valueof the concept reflect the student knowledge on the basis of his responses to problemshowever the value also depends on number of evidence for the concept and probabilityspecification of the network between concept to concept
Graph shows that most of the students well understood the easy concepts like Iterativesolutions Functionsparameter passing and recursive functions but unable to understanddifficult concept Sharp decrease in knowledge of concept pointer and pointer in functionsis due to availability of few evidences for concept In that case it is either correctnessor incorrectness of the evidence reflects the student only complete understanding or notunderstanding of concept
45
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
Chapter 9
Conclusion and Future work
In this thesis work intelligent tutoring system proposes a solution to overcome limita-tions of the traditional online courses using clickstream data captured during studentsinteraction with the system The proposed solution uses the moocs student interactiondata to gain more insight in student learning behaviour This can lead to take betterdecision for educators and researcher for feature offering of the course
Intelligent tutoring system uses technologies in from artificial intelligent to advancelearning and teaching Data mining techniques discovers useful information to assisteducators to make decisions or modifications in environment or teaching approachesIdentifying student interaction patterns behaviour and sequence of actions performed bystudent through clickstream analysis could enable more effective personalized supportfor student information seeking and learning in MOOCs Using the clustering techniquepossible to group similar behaviour student that can be used for personalized guidancefor each group of student
The proposed solution also builds a model for each individual during the assessmentand interaction with the system The model estimate concept wise knowledge of studentto make prediction whether student understand each concepttopic he learned Themodel can be used to provide personalised learning material as per individualrsquos demandfor each concept The analysis of students understanding for each concept can helpeducators and researcher to design future online course
Due to availability of large data sets of moocs clickstream there is lot of scopefor research in the field of education data mining The data captured for CS1011xdid not captured textbook event so this work could not draw better understandingabout student behaviour Using data set that covers all interactions in course it is possi-ble to make better model for student behaviour analysis using approach used in this work
In bayesian student modeling for estimating student concept wise understanding onecan use different sets of evidence available like time spend for concept or resources usedThe model implemented with binary values (knownot-know or correctincorrect ) for thedistribution so for more refine belief update(knowledge estimation) it is possible to takedistribution with set of values
46
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
References
[1] An introduction to bayesian networks with jayes 2015
[2] Wikipedia Massive open online course mdash wikipedia the free encyclopedia 2014[Online accessed 23-October-2014]
[3] Peter Brusilovsky Methods and techniques of adaptive hypermedia User modelingand user-adapted interaction 6(2-3)87ndash129 1996
[4] Kenneth R Koedinger Emma Brunskill Ryan SJd Baker Elizabeth A McLaugh-lin and John Stamper New potentials for data-driven intelligent tutoring systemdevelopment and optimization AI Magazine 34(3)27ndash41 2013
[5] Cristobal Romero and Sebastian Ventura Educational data mining A survey from1995 to 2005 Expert systems with applications 33(1)135ndash146 2007
[6] Ryan SJD Baker and Kalina Yacef The state of educational data mining in 2009A review and future visions JEDM-Journal of Educational Data Mining 1(1)3ndash172009
[7] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[8] Cory J Butz Shan Hua and R Brien Maguire A web-based bayesian intelligenttutoring system for computer programming Web Intelligence and Agent Systems4(1)77ndash97 2006
[9] Wikipedia Intelligent tutoring system mdash wikipedia the free encyclopedia 2014[Online accessed 6-September-2014]
[10] Cristobal Romero and Sebastian Ventura Educational data mining a review of thestate of the art Systems Man and Cybernetics Part C Applications and ReviewsIEEE Transactions on 40(6)601ndash618 2010
[11] RSJD Baker et al Data mining for education International encyclopedia of educa-tion 7112ndash118 2010
[12] Cristobal Romero Sebastian Ventura and Enrique Garcıa Data mining in coursemanagement systems Moodle case study and tutorial Computers amp Education51(1)368ndash384 2008
[13] Mirjam Kock and Alexandros Paramythis Activity sequence modelling and dynamicclustering for personalized e-learning User Modeling and User-Adapted Interaction21(1-2)51ndash97 2011
47
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
[14] Cristobal Romero Sebastian Ventura Pedro G Espejo and Cesar Hervas Datamining algorithms to classify students In Educational Data Mining 2008 2008
[15] Cesar Vialardi Javier Bravo Agapito Leila Shila Shafti and Alvaro Ortigosa Rec-ommendation in higher education using data mining techniques 2009
[16] Saleema Amershi and Cristina Conati Automatic recognition of learner groups inexploratory learning environments In Intelligent Tutoring Systems pages 463ndash472Springer 2006
[17] Antonio R Anaya and Jesus G Boticario Clustering learners according to theircollaboration In Computer Supported Cooperative Work in Design 2009 CSCWD2009 13th International Conference on pages 540ndash545 IEEE 2009
[18] Paul De Bra Peter Brusilovsky and Geert-Jan Houben Adaptive hypermedia fromsystems to framework ACM Computing Surveys (CSUR) 31(4es)12 1999
[19] Peter Brusilovsky Elmar Schwarz and Gerhard Weber Elm-art An intelligenttutoring system on world wide web In Intelligent tutoring systems pages 261ndash269Springer 1996
[20] Peter Brusilovsky and Christoph Peylo Adaptive and intelligent web-based ed-ucational systems International Journal of Artificial Intelligence in Education13(2)159ndash172 2003
[21] Peter Brusilovsky et al Adaptive and intelligent technologies for web-based eductionKI 13(4)19ndash25 1999
[22] George Siemens and Phil Long Penetrating the fog Analytics in learning andeducation EDUCAUSE review 46(5)30 2011
[23] edx research guide 2015
[24] Kalyan Veeramachaneni Franck Dernoncourt Colin Taylor Zachary Pardos andUna-May OrsquoReilly Moocdb Developing data standards for mooc data science InAIED 2013 Workshops Proceedings Volume page 17 Citeseer 2013
[25] Moocdb shared data model standard for the data emanating from moocs 2015
[26] Peter Brusilovsky and Eva Millan User models for adaptive hypermedia and adaptiveeducational systems In The adaptive web pages 3ndash53 Springer-Verlag 2007
[27] Constantino Martins Luız Faria Carlos Vaz De Carvalho and Eurico CarrapatosoUser modeling in adaptive hypermedia educational systems Educational Technologyamp Society 11(1)194ndash207 2008
[28] Loc Nguyen and Phung Do Learner model in adaptive learning World Academy ofScience Engineering and Technology 45395ndash400 2008
[29] Eva Millan Monica Trella Jose-Luis Perez-de-la Cruz and Ricardo Conejo Usingbayesian networks in computerized adaptive tests In Computers and Education inthe 21st Century pages 217ndash228 Springer 2000
48
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-
[30] Eva Millan Tomasz Loboda and Jose Luis Perez-de-la Cruz Bayesian networks forstudent model engineering Computers amp Education 55(4)1663ndash1683 2010
[31] Eva Millan and Jose Luis Perez-De-La-Cruz A bayesian diagnostic algorithm forstudent modeling and its evaluation User Modeling and User-Adapted Interaction12(2-3)281ndash330 2002
[32] Hugo Gamboa and Ana Fred Designing intelligent tutoring systems a bayesianapproach Enterprise Information Systems III Edited by J Filipe B Sharp and PMiranda Springer Verlag New York pages 146ndash152 2002
[33] Cristina Conati Abigail Gertner and Kurt Vanlehn Using bayesian networks tomanage uncertainty in student modeling User modeling and user-adapted interac-tion 12(4)371ndash417 2002
49
- Introduction
-
- MOOCs
- Challenges in MOOCs
- Motivation
- Intelligent Tutoring System for MOOCs
-
- Literature Review
-
- Review of Educational Data Mining
- Adaptive and Intelligent Technologies
-
- Adaptive Hypermedia
- ITS technologies
- Existing Systems
-
- The Open edX frontend architecture
-
- Units(Weeks) sequences panels and modules
- Naming schemes
- Events
-
- Open edx research data
-
- Student Info and Progress Data
- Course Content Data
-
- Shared Fields
- Course Data
- Course Building Block Data
- Course Component Data
-
- Tracking module_id in Course Structure
-
- Events in the Tracking Logs
-
- Common Fields
- Student Events
-
- Navigational Events
- Video Interaction Events
- Problem Interaction Events
- Discussion Forums Events
-
- New Findings in tracking logs
-
- Tools and Technologies
-
- Primary requirements for development
- Python Bayes Network Toolbox
- Jayes - Bayesian Networks for Java
- moocdb
-
- Experiments With Data
-
- Pre-Processing and Data Cleaning
- Feature Extraction
- Clustering Experiments and results analysis
-
- Experiment 1
- Experiment 2
- Experiment 3
- Experiment 4
- Experiment 5
-
- Student modeling using Bayesian Network
-
- Student Modeling
-
- Information in Student model
- Uncertainty-Based User Modeling
-
- Bayesian Diagnostic Algorithm for Student Modeling
-
- Bayesian Network
- Bayesian Student Modeling
-
- Experiments and Results
-
- Nodes and relations
- Parameter specification
-
- Conclusion and Future work
-