Performance Analysis on IITBombayX Using Event Log · PDF filePerformance Analysis on...

36
Performance Analysis on IITBombayX Using Event Log Data A Research & Development Report Submitted in partial fulfillment of requirements for the degree of Master of Technology by Sandeep Kale Roll No : 13305R003 under the guidance of Prof. Deepak B. Phatak Department of Computer Science and Engineering Indian Institute of Technology, Bombay 1 May, 2016

Transcript of Performance Analysis on IITBombayX Using Event Log · PDF filePerformance Analysis on...

Performance Analysis on IITBombayX Using Event

Log Data

A Research & Development Report

Submitted in partial fulfillment of requirements for the degree of

Master of Technology

by

Sandeep KaleRoll No : 13305R003

under the guidance of

Prof. Deepak B. Phatak

Department of Computer Science and EngineeringIndian Institute of Technology, Bombay

1 May, 2016

Contents

1 Introduction 21.1 Introduction: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 IITBombayX Architecture 32.1 IITBombayX Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Tracking Logs Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Proposed Approach and Prototype 53.1 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.3 Data Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Experimental Setup 84.1 Tools Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2 Cluster Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.3 Data Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5 Analysis of Student Behavior 95.1 Number of attempts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95.2 Navigation While Problem Solving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.3 Video watched in Other Than normal Speed . . . . . . . . . . . . . . . . . . . . . . . . . . 135.4 Videos played with Faster speed For CS101 . . . . . . . . . . . . . . . . . . . . . . . . . . 135.5 Videos played with slower speed For CS101 . . . . . . . . . . . . . . . . . . . . . . . . . . 145.6 Number of Times Video visited by same User more than Once For CS101 . . . . . . . . . 165.7 Video part played repeatedly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.8 Video part skipped repeatedly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 Conclusion and Acknowledgement 206.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206.2 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Appendices 21

A Analysis of Discussion Forum 22A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22A.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22A.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22A.4 Implementation Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23A.5 Queries For Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

A.5.1 Objective 1: Discussion thread created . . . . . . . . . . . . . . . . . . . . . . . . . 23A.5.2 Objective 2: Discussion Comment written . . . . . . . . . . . . . . . . . . . . . . . 24A.5.3 Objective 3: Comments On Own Thread . . . . . . . . . . . . . . . . . . . . . . . 24

B Queries To Create tables in Hive 25

C Script To Draw Replayed Video Part Graphs 30

1

CONTENTS

D Script to draw Forwarde Video part Graph 32

1

Chapter 1

Introduction

1.1 Introduction:

Today data analysis is very important aspect of almost every activity, be it on-line or offline. And thisis also the case for MOOC courses offered on EdX or IITBombayX. More and more data are generateddaily from activities conducted on this platform. It is not always possible to capture data in relationaldatabase. IITBombayX stores most of the data as log files to capture the minute details of activity onits platform. IITBombayX emits a large number of logs for different events happening on the platform.[7].

To get the valuable information from these log records, we have to analyse these records. Theselog records keep all the activity done by students while viewing study material, taking exams, postingquestions and participating in discussions, etc. In this project, we are doing generalised log analysis forall the courses. This analysis can be done for particular course as well with minor changes with code andqueries.

Chapter 2 discusses IITBombayX architecture, mostly focusing on event logs and overview of logfields. Chapter 3 discusses the proposed model and system design. Data cleaning and data loading forIITBombayX event logs is explained in detail. Chapter 4 discusses the experimental setup for this projectand tools and data used. Chapter 5 discusses the data analysis for student behavior and queries to dig theinformation. Appendix B list the table definition of Hive database. Appendix A discusses the Event loganalysis of discussion forum to get insides about thread creation and commenting behavior of participantin FDP course to be offered in future on IITBombayX. We discuss future work on this project in Chapter 6

2

Chapter 2

IITBombayX Architecture

IITBombayX is a web-based platform for creating, delivering, and analyzing online courses. IITBom-bayX uses architecture of open edX. IITBombayX also provides support for Blended Learning. Separateauthentication process is used in blended model as a wrapper to open edX.

2.1 IITBombayX Components

:

• CMS(Content management system): This allows for the authoring of tools. A Django applicationuses MongoDB(NoSQL) for content management.

• LMS(Learning Management System): The part of IITBombayX that students interact with. Itdisplays content, runs quizzes and interactive applications. It’s subcomponents are Wiki, DiscussionForum, etc.

• Event Tracking: Whenever a student interacts with the course, every action by the student is storedin logs, classified based on event type. For example, whenever a student clicks on some video towatch or to pause, these events are stored in logs with the adequate information to analyze it.Events are stored in JSON documents. IITBombayX records events of every interaction with thesystem by emitting events logs, these tracking logs are stored in permanent storage. These eventsare captured and stored as nested data structures in order to truly take advantage of schema-lessData storage systems. These event logs are stored as nested JSON objects.

• Open edX Insights and Analytics: Insights is a development version of a Python, Mongo, andDjango framework for creating simple, pluggable analytics based on streaming events. This doesnot include the analysis of every event from logs.

2.2 Tracking Logs Data

Tracking logs can be classified based on event type for which they are generated[6]. Events comprise offields which are common to all events, fields related to students activity, and fields related to course teamactivity.

These logs can be analyzed by checking the events they are emitted from. Some events and commonsfields are detailed below.

• Common Fields: Fields that are common to the schema definitions of all logs.

– Context: It contains course id, org id, path(URL that generated the event),user id fields.

– Event: This field provides information for the event this log is created.

– Event Source: This field is used to identify the application that was used from browser ormobile device.

– Event Type: This field provides information about for whom this event is created. It can bea student or course team member.

– Page: URL of the page, the user was visiting when the event was emitted.

3

CHAPTER 2. IITBOMBAYX ARCHITECTURE

– Time: Gives the UTC time at which the event was emitted.

– UserName: The username of the user who caused the event to be emitted.

• Student Events

– Enrollment Event: Activities like activation, deactivation of account.

– Navigational Events: Events like page close, goto position, and jump to discussion are found.

– Video Interaction Events: It consist of events like hide transcript, load video, pause video,play video, seek video, show transcript, speed change video, stop video, etc.

– Textbook Interaction Events: Consists of events for interaction with pdf and other text mate-rial provided.

– Problem Interaction Events: Interaction with problems in quizzes and exams are probleminteraction events. Some typical events are problem check, problem graded, problem save,problem show, save problem success, show answer, etc.

– Discussion Forum Events: This event is generated when a comment is created, a response isgiven, or a new thread is created in discussion forums.

4

Chapter 3

Proposed Approach and Prototype

In this chapter, we will discuss about the proposed approach for providing a solution to the problem.Then, system architecture and various data modules are explained. Later, details about extracting usefulData from the logs and loading it into structured database are provided.

3.1 Proposed Method

The goal is to clean and load the tracking logs data in Hadoop based distributed file system. This willenable the querying of large dataset feasible in a reasonable amount of time. Daily/Weekly reports basedon this can be sent to the respective authority.

3.1.1 System Architecture

The architecture diagram for model used in our prototype is shown below.

Figure 3.1: System Architecture for Proposed Model

5

CHAPTER 3. PROPOSED APPROACH AND PROTOTYPE

The above diagram explains how analysis will be done. Various steps are explained below in detail.

• Data: As explained in section 2.1, data are present in various modules, like LMS, CMS, trackinglogs, etc. Out of these, tracking logs are not structured. So these are preprocessed and cleaned.

• ETL: These tracking logs are cleaned and preprocessed using a JAVA program based on particularevent type.

• Storage: This data will then be moved to HDFS. From here, this can be used for analytics.

3.2 Data Cleaning

Logs file has logs of day stored in file, with no segregation done on event type of log. We have writtena parser which parse this JSON object having event log. These logs are processed one-by-one. Afterknowing and classifying event log object, we extract useful fields related to that event and store it in javaobject corresponding to that event such as Enrollment events, video events, discussion events etc. Whilesome of the logs are not properly structured, they are ignored. Also logs that are recorded for unknownuser is also ignored. Some of the events which is not documented in EdX Tracking Logs are partiallyignored as they occur very rarely. JAVA code for this parser is available at Wiki [4]. One such exampleof JSON object for problem type event is shown in listing 3.1

Listing 3.1: Example JSON object

{” agent ” : ” Moz i l l a /5 .0 (X11 ; Linux x86 64 ) AppleWebKit /537.36(KHTML, l i k e Gecko )Chrome /30 . 0 . 1599 . 101 S a f a r i /537.36 ” ,” context ” : {” c o u r s e i d ” : ”edx/AN101/2014 T1” ,”module” : {” display name ” : ” Mult ip l e Choice Quest ions ” } ,” o r g i d ” : ”edx” , ” u s e r i d ” :9999999 } , ” event ” :{” answers ” : {” i4x−edx−AN101−problem−a0e f fb954cca4759994 f1ac9e9434b f4 2 1 ” :” ye l low ” , ” i4x−edx−AN101−problem−a0e f fb954cca4759994 f1ac9e9434b f4 4 1 ” :[ ” c h o i c e 0 ” , ” c h o i c e 2 ” ]} , ” attempts ” : 1 , ” correct map ” :{” i4x−edx−AN101−problem−a0e f fb954cca4759994 f1ac9e9434b f4 2 1 ” :{” c o r r e c t n e s s ” : ” i n c o r r e c t ” , ” h int ” : ”” , ”hintmode” : nu l l , ”msg” :”” , ” npo ints ” : nu l l , ” queuestate ” : n u l l } ,” i4x−edx−AN101−problem−a0e f fb954cca4759994 f1ac9e9434b f4 4 1 ” :{” c o r r e c t n e s s ” : ” c o r r e c t ” , ” h int ” : ”” , ”hintmode” : nu l l , ”msg” : ”” ,” npo ints ” : nu l l ,” queues tate ” : n u l l }} , ” grade ” : 2 , ”max grade” : 3 ,” problem id ” : ” i 4 x : //edx/AN101/problem/a0e f fb954cca4759994 f1ac9e9434b f4 ” ,” s t a t e ” : {” correct map ” : {} , ”done” : nu l l , ” i n p u t s t a t e ” :{” i4x−edx−AN101−problem−a0e f fb954cca4759994 f1ac9e9434b f4 2 1 ” :{} , ” i4x−edx−AN101−problem−a0e f fb954cca4759994 f1ac9e9434b f4 4 1 ” : {}} ,” seed ” : 1 , ” s tudent answers ” : {}} , ” submiss ion ” :{” i4x−edx−AN101−problem−a0e f fb954cca4759994 f1ac9e9434b f4 2 1 ” :{”answer” : ” ye l low ” , ” c o r r e c t ” : f a l s e ,” input type ” : ” opt ion input ” ,” ques t i on ” : ”What c o l o r i s the open ocean on a sunny day?” ,” r e sponse type ” : ” opt ionre sponse ” , ” va r i an t ” : ”” } ,” i4x−edx−AN101−problem−a0e f fb954cca4759994 f1ac9e9434b f4 4 1 ” :{”answer” : [ ”a piano ” , ”a g u i t a r ” ] , ” c o r r e c t ” : true ,” input type ” : ” checkboxgroup ” ,” ques t i on ” : ”Which o f the f o l l o w i n g are musica l instruments ?” ,” r e sponse type ” : ” cho i c e r e spons e ” , ” var i ant ” : ”” }} ,” s u c c e s s ” : ” i n c o r r e c t ” } ,” even t sour c e ” : ” s e r v e r ” , ” event type ” : ” problem check ” ,

6

CHAPTER 3. PROPOSED APPROACH AND PROTOTYPE

” host ” : ” p r e c i s e 6 4 ” ,” r e f e r e r ” : ” h t t p : \/\/ l o c a l h o s t : 8 0 0 1 \/conta ine r \/ i 4 x : \/\/edX\/DemoX\/ v e r t i c a l \/69 dedd38233a46fc89e4d7b5e8da1bf4 ? ac t i on=new” ,” accept language ” : ”en−US, en ; q=0.8” , ” ip ” : ”NN.N.N.N” , ”page” :”x module” ,” time ” : 2014−03−03 T16:19:05 .584523+00 :00 ” , ”username” : ”AAAAAAAAAA”}

3.3 Data Loading

After cleaning the log one by one and stored useful value in Java object corresponding to the event, weload this data to Hadoop data store using hive insert queries. For more information about tables, referAppendix at the end of document in listing B.1. While data loading, we faced many problems whileinserting into Hive database. Problem was solved by updating to a newer version of Hive in gradualincrement. For future implementation of these types of project, we recommend to carefully select datatype which is supported by the Hive insert query builder, HivePreparedStatement.

7

Chapter 4

Experimental Setup

4.1 Tools Used

Apache Hadoop Version 2.6 [2]Apache Hive Version 0.14.0 , 0.1.2, 2.0.0 with gradual increment [6]Derby Version 1.6.0 [1]Apache Tez 0.8.3 [3]Eclipse

4.2 Cluster Setup

For experimental purpose, we have set up a cluster of two nodes, consisting of machines with Intel i5processor, 4GB RAM and 800GB each. Hadoop is installed on one system as master and other node asslaves. Hive is installed on top of Hadoop [5]. Derby is used as meta-store for Hive, running Darby asremote services. MySql can also be used here as meta-store. Hive by default supports Map-reduce(MR)execution engine on Hadoop, which is pretty slow in terms of execution. We have replaced the defaultMR execution engine with Apache Tez [3] for faster query execution.

4.3 Data Used

The data used in the analysis were IITBombayX log data of students that was in JSON format forduration of Autumn semester. We parse this JSON and extract useful fields which is later stored in Hivetable. We execute HiveQL query on the hive to get the desired result. We have ignored the logs whichare not formatted correctly or emitted when the user is not logged in.

8

Chapter 5

Analysis of Student Behavior

After parsing and loading data to database, we formulated some questions which can be answered fromlog data collected.

Some of these questions are listed as below:

• How many number of students answered the question in first attempt? What is the distribution ofnumber of attempts for students who answered correctly?

• What is student activity on platform while solving the problem? Do they visit discussion forum orsee video or else on IITBombayX Platform?

• How many students watched a video at video speed other than normal speed of 1x?

• Which videos are played at faster speed most of the time?

• Which videos are played at slow speed most of the time?

• How many times videos visited more than once by the same user?

• Find video whose video part is played repeatedly most of the time by all users?

• Find Video whose video part is skipped repeatedly most of the time by all users?

We faced problem in analyzing data due to infrastructural limitation, as on a cluster of two nodesquery ran forever exhausting resources. This is happening because there are almost 20 million recordsfor event logs in the database. Even doing a join on a small subset around 1 million tuples, we could notget result running query for more than 36 hours. Another reason could be a lack of expertise in tuningthe Hive.

We list the queries we have written to get the result for above question.

5.1 Number of attempts

To find the number of students who answered the question in first attempt, and distribution of attemptsto correctly answer questions, we first found in each course, for each question module in that course, whatis the max number of attempts by each student. After getting these values, we found in each course,count of the number of attempts irrespective of user. Figure 5.1 shows the distribution of number ofattempts for each course. From the figure, we can infer that, those students who answered correctlyfor CS101 has done in first attempt mostly. In contrast to that, EE210.1X, the more number studentanswered correctly in second attempt. Listing 5.1 gives the HiveQL query to get these results.

Listing 5.1: Number of attempts by student to answer question correctly

−− This query w i l l g i v e number o f a t tempts done by user to−− answer q u e s t i o n c o r r e c t l y f o r g i ven course .

CREATE TABLE tmpUserCorrectAnswerAttempts asSELECT modulesysname , username ,

9

CHAPTER 5. ANALYSIS OF STUDENT BEHAVIOR

1

10

100

1000

10000

100000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Num

ber

Of S

tudent

Number of Attempts

Number Of attempts by student to answer correctly

BMWCS101_1xBMWEE210_1x

CS101_1xA15EE210_1xA15

EE210_2xME209xA15WEE210_2x

Figure 5.1: Number of Attempts for Correct Answer

courseName , max( attempts ) as attemptsFROM edx . u s e r s e s s i o nWHERE s u c c e s s=’ c o r r e c t ’GROUPBY coursename , modulesysname , username ;

−− t h i s query w i l l g i v e number o f s t u d e n t who s o l v e d−− answer in n at tempts f o r g i ven course

SELECT courseName , attempts , count (∗ ) as cntFROM tmpUserCorrectAnswerAttemptsGROUPBY courseName , attempts ;

5.2 Navigation While Problem Solving

In any course offered, it is necessary to find out what is most helpful to students in the learning process.Is it videos, discussion forum or else? Evaluation is the best source to understand the student learning.When the student is given a problem to solve, how does he try to find a solution to a given problem?is the question that needs to be answered. For this, we decided to dig the data to find out where doesstudent visits while solving a problem. This can be inferred from the activity done by student on MOOCplatform. Figure 5.2 gives the number of events appearing while solving the problem for 3 differentcourses. The majority of events happening are video events while solving problem. Courses is a secondlargest thing visited by student while solving the problem. Discussion is another important forum wherestudent tries to find answers to the problem at hand. From navigation events taking considerable portionactivity, we can infer that student checks other problems in the list as well when solving problem. Fromfigure 5.2 text book interaction and wiki is less likely to be visited by students while solving the problem.

10

CHAPTER 5. ANALYSIS OF STUDENT BEHAVIOR

1

10

100

1000

10000

100000

1e+06

1e+07

CS101.1x

EE210.1x

ME209xA

15

Num

ber

Of E

vents

Course Name

Number of events between Problem solving

coursesdiscussionnavigation

textbookInteractionvideo

wiki

Figure 5.2: Events While Problem Solving

Queries in listing 5.2 gives the activity of student while solving the problem.

Listing 5.2: Query to find navigation to other resources while solving the problem

DROP TABLE IF EXISTS CS101 1X EventBetweenProblems ;DROP TABLE IF EXISTS EE210 1X EventBetweenProblems ;DROP TABLE IF EXISTS ME209 XA15 EventBetweenProblems ;

−− Query to g e t a l l the e v e n t s o f Problem type f o r course CS101 .1X

CREATE TABLE CS101 1X ProblemUser s to r ed AS orc ASSELECT ‘ s e s s i on Id ‘ , ‘ userName ‘ , ‘ courseName ‘ , ‘ createDateTime ‘ ,‘ eventType ‘ , ‘ eventName ‘ , ‘ moduleSysName ‘ , ‘ eventNo ‘FROM ‘ UserSess ionOld ‘WHERE eventType=”problem” AND courseName LIKE ’ CS101%’ ;

−− Query to g e t time i n t e r v a l f o r each user on course between two−− e v e n t s o f problem on same module .−− We are t a k i n g l o n g e s t p o s s i b l e i n t e r v a l

CREATE TABLE CS101 1X ProblemSelfOnTimeMinMax s to r ed AS orc ASSELECT username , min( createDateTime ) AS i n t e r v a l S t a r t ,moduleSysName ,max( createDatet ime ) AS in terva lEndFROM CS101 1X ProblemUserGROUPBY username , moduleSysName ;

11

CHAPTER 5. ANALYSIS OF STUDENT BEHAVIOR

−− Query to g e t a l l the e v e n t s o th er than Problem type f o r course CS101 .1X

CREATE TABLE CS101 1X OtherEventsUser s to r ed AS orc ASSELECT ‘ s e s s i on Id ‘ , ‘ userName ‘ , ‘ courseName ‘ ,‘ createDateTime ‘ , ‘ eventType ‘ , ‘ eventName ‘ ,‘ moduleSysName ‘ , ‘ eventNo ‘FROM ‘ UserSess ionOld ‘WHERE eventType !=”problem” AND courseName LIKE ’ CS101%’ ;

−− Query to g e t a l l the e v e n t s appear ing between l a n g e s t time i n t e r v a l−− user spends on one q u e s t i o n module . We f u r t h e r g e t the count o f−− e v e n t s by grouping on event type o f event .

CREATE TABLE CS101 1X EventBetweenProblems s to r ed AS orc ASSELECT u2 . eventType , count (∗ ) AS cntFROM CS101 1X ProblemSelfOnTimeMinMax u1 JOIN CS101 1X OtherEventsUser u2WHERE u1 . username = u2 . usernameAND u1 . i n t e r v a l S t a r t <u2 . createDatet imeAND u2 . createdatet ime< u1 . interva lEndGROUPBY u2 . eventType ;

−−−−−−−−−−−−−−−−−−−−−−−−−−− Fol lowing q u e r i e s are same as above f o r d i f f e r e n t coursesCREATE TABLE EE210 1X ProblemUser s to r ed AS orc ASSELECT ‘ s e s s i on Id ‘ , ‘ userName ‘ , ‘ courseName ‘ ,‘ createDateTime ‘ , ‘ eventType ‘ , ‘ eventName ‘ ,‘ moduleSysName ‘ , ‘ eventNo ‘FROM ‘ UserSess ionOld ‘WHERE eventType=”problem” AND courseName LIKE ’EE210%’ ;

CREATE TABLE EE210 1X ProblemSelfOnTimeMinMax s to r ed AS orc ASSELECT username , min( createDateTime ) AS i n t e r v a l S t a r t ,moduleSysName ,max( createDatet ime ) AS in terva lEndFROM EE210 1X ProblemUserGROUPBY username , moduleSysName ;

CREATE TABLE EE210 1X OtherEventsUser s to r ed AS orc ASSELECT ‘ s e s s i on Id ‘ , ‘ userName ‘ , ‘ courseName ‘ ,‘ createDateTime ‘ , ‘ eventType ‘ , ‘ eventName ‘ ,‘ moduleSysName ‘ , ‘ eventNo ‘FROM ‘ UserSess ionOld ‘WHERE eventType !=”problem” AND courseName LIKE ’EE210%’ ;

CREATE TABLE EE210 1X EventBetweenProblems s to r ed AS orc ASSELECT u2 . eventType , count (∗ ) AS cntFROM EE210 1X ProblemSelfOnTimeMinMax u1 JOIN EE210 1X OtherEventsUser u2WHERE u1 . username = u2 . usernameAND u1 . i n t e r v a l S t a r t <u2 . createDatet ime ANDu2 . createdatet ime< u1 . interva lEndGROUPBY u2 . eventType ;

−−−−−−−−−−−−−−−−−−−−−−−

CREATE TABLE ME209 XA15 ProblemUser s to r ed AS orc ASSELECT ‘ s e s s i on Id ‘ , ‘ userName ‘ , ‘ courseName ‘ ,‘ createDateTime ‘ , ‘ eventType ‘ , ‘ eventName ‘ ,

12

CHAPTER 5. ANALYSIS OF STUDENT BEHAVIOR

‘ moduleSysName ‘ , ‘ eventNo ‘FROM ‘ UserSess ionOld ‘WHERE eventType=”problem” AND courseName LIKE ’ME209%’ ;

CREATE TABLE ME209 XA15 ProblemSelfOnTimeMinMax s to r ed AS orc ASSELECT username , min( createDateTime ) AS i n t e r v a l S t a r t ,moduleSysName ,max( createDatet ime ) AS in terva lEndFROM ME209 XA15 ProblemUserGROUPBY username , moduleSysName ;

CREATE TABLE ME209 XA15 OtherEventsUser s to r ed AS orc ASSELECT ‘ s e s s i on Id ‘ , ‘ userName ‘ , ‘ courseName ‘ ,‘ createDateTime ‘ , ‘ eventType ‘ , ‘ eventName ‘ ,‘ moduleSysName ‘ , ‘ eventNo ‘FROM ‘ UserSess ionOld ‘WHERE eventType !=”problem” AND courseName LIKE ’ME209%’ ;

CREATE TABLE ME209 XA15 EventBetweenProblems s to r ed AS orc ASSELECT u2 . eventType , count (∗ ) AS cntFROM ME209 XA15 ProblemSelfOnTimeMinMax u1JOIN ME209 XA15 OtherEventsUser u2WHERE u1 . username = u2 . usernameAND u1 . i n t e r v a l S t a r t <u2 . createDatet imeAND u2 . createdatet ime< u1 . interva lEndGROUPBY u2 . eventType ;

5.3 Video watched in Other Than normal Speed

Videos can be seen in speed varying from 0.5x,0.75x,1.0x,1.25x,1.5x and 2x. We will try to figure outwhich videos are played most of the time for speed other than normal speed. Figure 5.3 gives a plot ofnumber times videos for courses are seen in speed other than normal 1X speed. Listing 5.3 gives examplequery to give data needed to plot graph in figure 5.3

Listing 5.3: Query to Video watched in Other Than normal Speed

DROP TABLE IF EXISTS videpSpeed courseWise ;create table videpSpeed courseWise s to r ed as orc asselect courseName , currVideoSpeed ,count (∗ ) cntfrom u s e r S e s s i o n o l d where eventType=’ v ideo ’ andeventName=’ speed change v ideo ’ and currVideoSpeed !=1.0 groupby courseName , currVideoSpeed order bycourseName , currVideoSpeed ;

bin / hive −e ’ s e l e c t ∗ from videpSpeed courseWise ’ >/home/ hduser / resultRnD / MisResult / videoSpeedCourseWise . txt

5.4 Videos played with Faster speed For CS101

Some videos are lengthy and the student tries to skip the video or watch it fast forward. In this sectionwe will draw a graph for the number of times particular video is played at faster speeds for course CS101.We will draw top 10 videos which are played fast forward as shown in Figure 5.4. If repeatedly video iswatched fast forward, we can analyze the video content for the problem, if that video is very lengthy andneeds cut. In the figure, Video name is given by Video ID, and it will be replaced with name in futureversions of the code.

Listing 5.4 gives example query to find data required to plot the graph in Figure 5.4.

13

CHAPTER 5. ANALYSIS OF STUDENT BEHAVIOR

0

200

400

600

800

1000

1200

1400

1600

1800

2000

CS101.1x

EE210.1x

EE210.2x

ME209xA

15

Num

ber

Of E

vents

Course Name

Number of time Video watched in Given Speed

Speed 0.5Speed 0.75Speed 1.25Speed 1.50

Speed 2

Figure 5.3: Number of times video watched other than given speed for Courses

Listing 5.4: Videos played with Faster speed For CS101

DROP TABLE IF EXISTSvideoSpeed Fast courseWise videoWise CS101 ;

CREATE TABLE videoSpeed Fast courseWise videoWise CS101s to r ed as orc asselect moduleSysName , currVideoSpeed ,count (∗ ) as cntfrom u s e r S e s s i o n o l d where eventType=’ v ideo ’ andeventName=’ speed change v ideo ’ and currVideoSpeed > 1 .0 andcourseName l ike ’ CS101%’group by moduleSysName , currVideoSpeed order bymoduleSysName , currVideoSpeed , cnt ;

bin / hive −e ’ s e l e c t moduleSysName , sum( cnt ) as count1 fromvideoSpeed Fast courseWise videoWise CS101 group bymoduleSysName order by count1 desc l i m i t 10 ’ >/home/ hduser / resultRnD / MisResult / videoSpeedFastCourseWiseCS101 . txt

5.5 Videos played with slower speed For CS101

Some Videos are faster in terms of delivering content or explaining the concept. In this case, studentsmay opt for seeing video in slower space. In this section we will draw a graph for the number of timesparticular video is played at a slower speed for course CS101. We will draw top 10 videos which areplayed fast forward as shown in Figure 5.5. If repeatedly video is watched in slow speed, we can analyze

14

CHAPTER 5. ANALYSIS OF STUDENT BEHAVIOR

50

100

150

200

250

300

350

400

e03113c234f14a848f725a44f8d1265d

a43751d47a934a9da245f28a94e1759e

a3d6a7442ff949409f2d34dddea9a984

4f62458015584c0bbf904ce09caacc5d

b17a31311f9b4d22a6719fd9e670efe8

641407821f3a44ee9530a3ea900e2e80

cfb296bd686b408bb997fb14696235fa

bd155f3df2254815ac1f6812f9b4e4bd

67a8559582864d6a8148e2ef5c997e8f

fa1f6040f46a43298cc25fc33db89a83

Num

ber

of tim

es w

atc

hed in fast speed

Video Module Name

Number of time Video watched Fast Speed

Number Of Times

Figure 5.4: Number of times video watched in faster speed

the video content for the problem, if that video is very fast in delivering content. In figure, Video nameis given by Video ID, and it will be replaced with name in future version of code.

Listing 5.5 gives example query to find data required to plot the graph in Figure 5.5.

Listing 5.5: Videos played with slower speed For CS101

DROP TABLE IF EXISTSvideoSpeed s low courseWise videoWise CS101 ;

CREATE TABLE videoSpeed s low courseWise videoWise CS101s to r ed as orc asselect moduleSysName , currVideoSpeed ,count (∗ ) as cntfrom u s e r S e s s i o n o l d where eventType=’ v ideo ’ andeventName=’ speed change v ideo ’ and currVideoSpeed < 1 .0 andcourseName l ike ’ CS101%’group by moduleSysName , currVideoSpeed order bymoduleSysName , currVideoSpeed , cnt ;

bin / hive −e ’ s e l e c t moduleSysName , sum( cnt ) as count1 fromvideoSpeed s low courseWise videoWise CS101 group bymoduleSysName order by count1 desc l i m i t 10 ’ >/home/ hduser / resultRnD / MisResult /videoSpeedSlowCourseWiseCS101 . txt

15

CHAPTER 5. ANALYSIS OF STUDENT BEHAVIOR

5

10

15

20

25

30

35

40

45

a3d6a7442ff949409f2d34dddea9a984

641407821f3a44ee9530a3ea900e2e80

a43751d47a934a9da245f28a94e1759e

e03113c234f14a848f725a44f8d1265d

b17a31311f9b4d22a6719fd9e670efe8

67a8559582864d6a8148e2ef5c997e8f

fa1f6040f46a43298cc25fc33db89a83

4f62458015584c0bbf904ce09caacc5d

d8ecc3a8cdd0461ab426495dc65897f4

cfb296bd686b408bb997fb14696235fa

Num

ber

of tim

es w

atc

hed in S

low

speed

Video Module Name

Number of time Video watched Slow Speed

Number Of Times

Figure 5.5: Number of times video watched in slower speed

5.6 Number of Times Video visited by same User more thanOnce For CS101

In this section, we draw a graph to find out number of User who visited the same video more than onceduring their course at IITBombayX platform. This does not include video watched in same session byseeking video to start. Figure 5.6 is graph plotted for CS101 course videos against the number of userswho visited video more than once.

Listing 5.6 gives query to find data to draw graph in figure 5.6

Listing 5.6: Number of Times Video visited by same User more than Once For CS101

DROP TABLE IF EXISTSvideoWatch courseWise userName videoWise count CS101 ;CREATE TABLE videoWatch courseWise userName videoWise count CS101s to r ed as orc asselect userName , moduleSysName , count (∗ ) as cntfrom userSes s ionOld where eventType=’ v ideo ’ andeventName=’ l oad v ideo ’ and courseName l ike ’ CS101%’group by userName , moduleSysName having cnt>1order by username , moduleSysName ;

bin / hive −e ’ s e l e c t moduleSysName , count (∗ ) as cnt fromvideoWatch courseWise userName videoWise count CS101 wheremoduleSysname!= ’ null ’ group by moduleSysName order by cntdesc l i m i t 10 ; ’ >/home/ hduser / resultRnD / MisResult /VideoWatchedMoreThanOnceBySameUserCS101 . txt

16

CHAPTER 5. ANALYSIS OF STUDENT BEHAVIOR

0

200

400

600

800

1000

1200

1400

641407821f3a44ee9530a3ea900e2e80

a3d6a7442ff949409f2d34dddea9a984

a43751d47a934a9da245f28a94e1759e

b17a31311f9b4d22a6719fd9e670efe8

e03113c234f14a848f725a44f8d1265d

cfb296bd686b408bb997fb14696235fa

f61de37103ab42f2b50f5f5e43489e70

fa1f6040f46a43298cc25fc33db89a83

bd155f3df2254815ac1f6812f9b4e4bd

67a8559582864d6a8148e2ef5c997e8f

Num

ber

Of U

ser

Video Module Name (Top 10)

Number of time Video watched By Same User More Than Once

Number Of Times

Figure 5.6: Number of Times Video visited by same User more than Once

5.7 Video part played repeatedly

While watching videos at IITBombayX, the student has opted to seek back or forward in the video. Inthis section we will draw a graph for each video, showing from start of the video, which part of the video,in second, is watched repeatedly, for how many number of times as shown Figure 5.7 for one randomvideo. Chapter C gives script to draw this type of graph for all the videos. Queries in listing 5.7 is tofind the range of seeks backward in the video for all the videos. Video is represented as it Module Id fortime being, and will be replaced by its name in future version.

Listing 5.7: Query to find ranges in second of video seek backwards

DROP TABLE IF EXISTS videoSeek rep lay videpWise CS101 ;create table videoSeek rep lay videpWise CS101 s to r ed as orc asselect moduleSysname , min( currVideoTime ) as low ,max( oldVideoTime ) as high , count (∗ ) as cnt fromv ideoSeek rep lay v ideoWise where courseName l ike ’ CS101%’ group by moduleSysName order by cnt desc limit 100 ;

DROP TABLE IF EXISTSvideoSeek replay videpWise CS101 Ranges ;

create table videoSeek replay videpWise CS101 Rangess to r ed as orc asselect u1 . moduleSysName , u1 . currVideoTime , u1 . oldVideoTimefrom v ideoSeek rep lay v ideoWise u1 joinvideoSeek rep lay videpWise CS101 u2 where courseName l ike

17

CHAPTER 5. ANALYSIS OF STUDENT BEHAVIOR

’ CS101%’and u1 . moduleSysName=u2 . moduleSysNameorder by u1 . moduleSysName ;

bin / hive −e ’ s e l e c t moduleSysname , ca s t ( low as i n t ) low ,ca s t ( high as i n t ) high fromvideoSeek rep lay videpWise CS101 ’ >/home/ hduser / resultRnD /CS101/Vid . txt

bin / hive −e ’ s e l e c t u1 . moduleSysName , ca s t ( u1 . currVideoTimeas i n t ) curr , ca s t ( u1 . oldVideoTime as i n t ) o ld fromvideoSeek replay videpWise CS101 Ranges u1 ’ >/home/ hduser / resultRnD /CS101/VidRanges . txt

0

10

20

30

40

50

60

0 100 200 300 400 500 600 700

Num

ber

of T

imes V

ideo p

art

Repla

yed

Time from start of video in Second

Video Replay for CS101x15_2e83efb1ad5c4ba39a075eca242f6d52

Figure 5.7: Number of times Video part replayed for a Video.

5.8 Video part skipped repeatedly

While watching videos at IITBombayX, student has option to seek back or forward in video. In thissection we will draw a graph for each video, showing from start of video, which part of video, in second,is forwarded repeatedly, for how many number of times as shown Figure 5.8 for one random video.Chapter D gives script to draw this type of graph for all the videos. Queries in listing 5.8 is to find therange of seeks backward in video for all the videos. Video is represented as it Module Id for time being,and will be replaced by its name in future version.

Listing 5.8: Query to find ranges in second of video seek forward

18

CHAPTER 5. ANALYSIS OF STUDENT BEHAVIOR

DROP TABLE IF EXISTS videoSeek sk ip videpWise CS101 ;create table videoSeek sk ip videpWise CS101 s to r ed as orc asselect moduleSysname , min( oldVideoTime ) as low ,max( currVideoTime ) as high , count (∗ ) as cnt fromv ideoSeek sk ip v ideoWise where courseName l ike ’ CS101%’group by moduleSysName order by cnt desc limit 100 ;

DROP TABLE IF EXISTS videoSeek skip videpWise CS101 Ranges ;create table videoSeek skip videpWise CS101 Rangess to r ed as orc asselect u1 . moduleSysName , u1 . currVideoTime , u1 . oldVideoTimefrom v ideoSeek sk ip v ideoWise u1 joinvideoSeek sk ip videpWise CS101 u2 where courseName l ike ’ CS101%’and u1 . moduleSysName=u2 . moduleSysNameorder by u1 . moduleSysName ;

bin / hive −e ’ s e l e c t moduleSysname , ca s t ( low as i n t ) low ,ca s t ( high as i n t ) high from videoSeek sk ip videpWise CS101 ’> /home/ hduser / resultRnD / CS101 forward /Vid . txt

bin / hive −e ’ s e l e c t u1 . moduleSysName , ca s t ( u1 . oldVideoTimeas i n t ) old , ca s t ( u1 . currVideoTime as i n t ) curr fromvideoSeek skip videpWise CS101 Ranges u1 ’ >/home/ hduser / resultRnD / CS101 forward /VidRanges . txt

0

50

100

150

200

250

300

350

400

0 100 200 300 400 500 600 700 800 900

Num

ber

of T

imes V

ideo p

art

Forw

ard

ed

Time from start of video in Sec

Video Forward for CS101x15_b17a31311f9b4d22a6719fd9e670efe8

Figure 5.8: Number of times Video part forwarded for a Video.

19

Chapter 6

Conclusion and Acknowledgement

6.1 Conclusion

We successfully parsed event logs in JSON format of IITBombayX Event tracking logs and loaded thedata into Hadoop. We also successfully executed some queries to get valuable information out of logs.To conclude our work, we can say that there is a lot more things need to be done to make this systemwork in robustly. I am listing out some of things which can be included as future work, which will addmore functionality and increases robustness of this system.

• Implement distributed, multi-threaded program to parse the logs as for single system, resourcesare exhausted and operations are time consuming. As we already have system in place for singlethread in place, and processing is independent of each log, we can implement without changing coresystem of parsing log

• Hive is not update friendly such as in case of insertion of new tuple into Hive table. Current systeminsert each tuple at a time. We can add new tuples in batch to increase the speed of insertion.

6.2 Acknowledgement

This project is extension to work done with Sukla Nag and Rahul Parashar. I am very thankful to SuklaNag for valuable inputs given while working on this project. Sukla Nag designed the data model for thisproject in MySQL and also identified the Fields to be parsed. We worked with her in initial phase of thisproject and later worked independently. Initially this project was planned on MySQL as backend.

20

Appendices

21

Appendix A

Analysis of Discussion Forum

A.1 Introduction

IIT Bombay will be running a FDP Course on IITBombayX for faculties from different institutes in nearfuture. There are some basic rules which has been laid down by Professor for participant to be eligiblefor certificate. Some of them relevant to our discussion in this document is listed as below.

• Each participant must create at least n ( Say 3) questions/notes in discussion forum for all otherparticipant to respond.

• Each participant must respond to at least m ( say 5) questions/notes created by other members (comments on his own post will not be counted as that would be considered as as his/her implicitduty to so).

• Each participant must spend some time ( how much would be decided by Professor after coursestarts) on reading discussion forum before replying to thread. This is to ensure that participant donot reply without reading just to satisfy above two rules

A.2 Objective

• Find number of discussion thread created by each participant in this course.

• Find number of replies given by each participant in discussion forum.

• Find number of replies given by discussion creator to same discussion ( to ensure that he stays backto help other participants to clear doubts, if any, about what he posted)

• Draw a graph to visualise the discussion pattern of all the participants from above data

A.3 Background

Edx emits logs for all the events generated by user activities. These events are emitted and included indaily event logs. All logs are in JSON format. Specific to above problem, following are relevant events

• When a participant creates a new thread, such as a student asking a question, the server emits anedx.forum.thread.created event.

• When a participant responds to a thread, such as another student answering the question, theserver emits an edx.forum.response.created event.

• When a participant adds a comment to a response, such as a course team member adding a clari-fication to the student answer, the server emits an edx.forum.comment.created event.

Discussion events types are organized in hierarchical relationship.

• A CommentThread represents the first level of interaction: a post that opens a new thread, oftena student question of some sort.

22

APPENDIX A. ANALYSIS OF DISCUSSION FORUM

Table A.1: Hive Table: eventDiscussionField Type Remarkscourse id String Course Idorg id String Organisation offerin courseuser id Long Integer User IdmoduleSysName String Module Title for resourseevent type String Event Type, ”CommentThread” or ”Comment”session String Session IdtimeCreatedat Timestamp Time of event creationusername String Username of usercommentId Integer Comment Id for given commentCommentType String created or responseComment type String question or discussiontimeUpdatedat Timestamp last update to eventparent id Integer Comment Id of parent commentcomment count integer Number of replies on given comment

• A Comment represents both the second and third levels of interaction: a response made directly tothe conversation started by a CommentThread is a Comment. Any further contributions made toa specific response are also in Comment objects.

For more, specific details of fields in above event log, visit EdX Research Guide. [7]

A.4 Implementation Approach

We need to extract useful information for all event logs of type CommentThread and Comment as wementioned above. We need to store the data extracted from log to structured database. To store all therelevant field necessary for our analysis, we will create a table having following fields but not limited to.

Now next part is parsing these filed from JSON object for discussion events. We have to extend existingparser to incorporate following field. After parsing the logs, cleaning and loading data to database table.We have to executed following sets of query to find out results for objectives mentioned above.

A.5 Queries For Objectives

Following set of queries will help us getting desired analyse for objectives. We are assuming that queriesare executed on Hive.

A.5.1 Objective 1: Discussion thread created

Listing A.1: Queries for Objective 1

−−Create a tempoary t a b l e in h i v e :CREATE TABLE userCommentThreadCount ( u s e r i d int , count int ) ;

Insert into userCommentThreadCount select use r id , count (∗ ) fromtable eventDi s cus s i on wherecommentType=’CommentThread ’ and c o u r s e i d=’<YourCourseName> ’Group by u s e r i d ;

−− This w i l l g i v e use r s who c r e a t e d a t l e a s t one thread and−− count o f how many t h r e a d s they c r e a t e d i n d i v i d u a l l y .

−− By doing L e f t outer j o i n o f User database t a b l e ( say User t a b l e )−− with userCommentThreadCount , You w i l l g e t f i n a l number o f count f o r−− each user .

23

APPENDIX A. ANALYSIS OF DISCUSSION FORUM

select u . u s e r i d , t . count fromUser u l e f t outer join userCommentThreadCount t onu . u s e r i d=t . u s e r i d ;

−− From above , you can f i n d out u ser s who completed the b a s i c−− requirement o f p a r t i c i p a t i o n o f c r e a t i n g s p e c i f i c number o f t h r e a d s .

A.5.2 Objective 2: Discussion Comment written

Listing A.2: Queries for Objective 2

−−CREATE TABLE userCommentCount ( u s e r i d int , count i n t ) ;

Insert into userCommentCount SELECT use r id , count (∗ ) fromtable eventDi s cus s i on wherecommentType=’Comment ’ and c o u r s e i d=’<YourCourseName> ’Group by u s e r i d ;

−− This w i l l g i v e use r s who commented on at l e a s t one thread−− and count o f how many comment they w r i t t e n i n d i v i d u a l l y .

−− By doing L e f t outer j o i n o f User database t a b l e ( say User t a b l e )−− with userCommentThreadCount , You w i l l g e t f i n a l number o f count f o r−− each user .

Select u . u s e r i d , t . count fromUser u l e f t outer join userCommenCount ton u . u s e r i d=t . u s e r i d ;

−− From above , you can f i n d out u ser s who completed the b a s i c−− requirement o f p a r t i c i p a t i o n o f w r i t i n g minimum number o f comments .

A.5.3 Objective 3: Comments On Own Thread

Listing A.3: Queries for Objective 3

CREATEVIEW threadCommented asSELECT use r id , commentid , parent id from eventDi s cus s i onwhere c o u r s e i d=’ YourCourseName ’ ;

SELECT t1 . u s e r i d , count (∗ ) from tablethreadCommented t1 join threadCommented t2Where t1 . u s e r i d==t2 . u s e r i d and t1 . commentid = t2 . parent id

−− Above query w i l l g i v e you count o f number o f r e p l i e s−− g iven by user to h i s own comment thread .

24

Appendix B

Queries To Create tables in Hive

Listing B.1: Queries To Create Hive tables in hive

CREATE TABLE C i t i e s ( id int , name Str ing , s t a t e I d int )STORED AS ORC ;

CREATE TABLE Course ( cour se Id int , lmsName Str ing , orgName Str ing ,courseName Str ing , c o u r s e T i t l e Str ing , authorUserId int ,currConcepts Str ing , prevConcepts Str ing , courseLang Str ing ,minPrice int , sugge s t edPr i c e int , currencyCode Str ing , endDatetimestamp , s ta r tDate timestamp ) STORED AS ORC ;

CREATE TABLE CourseCategory ( courseCatg id int , categoryName Str ing ,courseCounts int , parentId int ) STORED AS ORC ;

CREATE TABLE CourseChapter ( chapter Id int , lmsName Str ing , orgNameStr ing , courseName Str ing , chapteT i t l e Str ing , chapterSysNameStr ing , chapterStartDate timestamp , position int ) STORED AS ORC ;

CREATE TABLE CourseChapterSess ion ( s e s s i o n I d int , lmsName Str ing ,orgName Str ing , courseName Str ing , chapterSysName Str ing ,sessionSysName Str ing , s e s s i o n T i t l e Str ing , s e s s i onSta r tDate date ,position int ) STORED AS ORC ;

CREATE TABLE CourseDiscuss ions ( d i s c u s s i o n I d int , lmsName Str ing ,orgName Str ing , courseName Str ing , chapterSysName Str ing ,d i s c u s s i o n T i t l e Str ing , discussionSysName Str ing , d i s c u s s i o n S y s I dSt r ing ) STORED AS ORC ;

CREATE TABLE CourseF i l e s ( f i l e I d int , lmsName Str ing , orgNameStr ing , courseName Str ing , chapterSysName Str ing , sessionSysNameStr ing , f i l e T i t l e Str ing , f i leSysName St r ing ) STORED AS ORC ;

CREATE TABLE CourseForums ( forumId b ig in t , lmsName Str ing , orgNameStr ing , courseName Str ing , courseRun Str ing , commentSysId Str ing ,commentType Str ing , anonymousMode Str ing , lmsAuthorId b ig in t ,lmsAuthorName Str ing , createDateTime timestamp , lastModDateTimetimestamp , upVoteCount Str ing , totVoteCount Str ing , commentCountint , threadType Str ing , t i t l e Str ing , commentableSysId Str ing ,endorsed Str ing , c l o s e d boolean , v i s i b l e boolean ) STORED AS ORC ;

CREATE TABLE CourseOthers ( other Id int , lmsName Str ing , orgNameStr ing , courseName Str ing , t i t l e Str ing , htmlSysName Str ing , type

25

APPENDIX B. QUERIES TO CREATE TABLES IN HIVE

Str ing , vert ica lSysName Str ing , chapterSysName Str ing ,sessionSysName Str ing , f i leName St r ing ) STORED AS ORC ;

CREATE TABLE CourseProblems ( problemId b ig in t , lmsName Str ing ,orgName Str ing , courseName Str ing , chapterSysName Str ing ,sessionSysName Str ing , quizSysName Str ing , q u i z T i t l e Str ing ,quizType Str ing , quizWeight f loat , noOfAttemptsAllowed int ,quizMaxMarks f loat , h i n tAva i l ab l e int , c o r r e c tCho i c e int , hintModeSt r ing ) STORED AS ORC ;

CREATE TABLE CourseRun ( courseRunid int , lmsName Str ing , orgNameStr ing , courseName int , courseRun Str ing , wi l lbeGraded Str ing ,gradePass f loat , a c t u a l P r i c e int , currencyCode Str ing , s tar tDatedate , endDate date ) STORED AS ORC ;

CREATE TABLE CourseVers ion ( cour s eVer s i on id int , cour se Id int ,De s c r ip t i on Str ing , CreatedOn date , LastModi f ied date ,f i lePathName St r ing ) STORED AS ORC ;

CREATE TABLE CourseVer t i ca l ( ve r t Id b ig in t , lmsName Str ing , orgNameStr ing , courseName Str ing , sessionSysName Str ing , vert ica lSysNameSt r ing ) STORED AS ORC ;

CREATE TABLE CourseVideos ( v ideoId int , lmsName Str ing , orgNameStr ing , courseName Str ing , chapterSysName Str ing , videoSysNameStr ing , videoUTubeId Str ing , videoDownload int , videoTrackDownLoadint , v i d e o T i t l e Str ing , videoUTubeId075 Str ing , videoUTubeId125Str ing , videoUTubeId15 Str ing , v ideo l ength f loat ) STORED AS ORC ;

CREATE TABLE CourseWiki ( wik i Id int , lmsName Str ing , orgNameStr ing , courseName Str ing , wik iS lug Str ing , lmsWikiId b ig in t ,createdDate timestamp , lastModDate timestamp , l a s tRevId int , ownerIdb ig in t , groupId b ig in t , groupRead int , groupWrite int , otherReadint , otherWrite int ) STORED AS ORC ;

CREATE TABLE EventCourseInteract ( eventId b ig in t , lmsName Str ing ,orgName Str ing , courseName Str ing , courseRun Str ing , lmsUserIdb ig in t , eventName Str ing , eventNo int , moduleType Str ing ,moduleSysName Str ing , moduleTit le Str ing , chapterSysName Str ing ,c h a p t e r T i t l e Str ing , createDateTime timestamp , modDateTime timestamp ,o l d P o s i t i o n int , cu rPos i t i on int , source St r ing ) STORED AS ORC ;

CREATE TABLE EventDiscuss ion ( eventId b ig in t , lmsName Str ing ,orgName Str ing , courseName Str ing , eventName St r ing ) STORED AS ORC ;

CREATE TABLE EventEnrollment ( ventId b ig in t , lmsName Str ing ,orgName Str ing , courseName Str ing , eventName Str ing , lmsUserIdb ig in t , userName Str ing , gender Str ing , eduLevel Str ing , a c t i v a t eStr ing , adminUserId b ig in t , dateTime timestamp ) STORED AS ORC ;

CREATE TABLE EventForumInteract ( eventId b ig in t , lmsName Str ing ,orgName Str ing , courseName Str ing , eventName Str ing , commentThreadIdStr ing , lmsUserId b ig in t , queryText Str ing , noOfResults int )STORED AS ORC ;

CREATE TABLE Event Ins t ructor ( eventId b ig in t , lmsName Str ing ,orgName Str ing , courseName Str ing , eventName St r ing ) STORED AS ORC ;

CREATE TABLE EventNavigation ( eventId b ig in t , lmsName Str ing ,

26

APPENDIX B. QUERIES TO CREATE TABLES IN HIVE

orgName Str ing , courseName Str ing , eventName St r ing ) STORED AS ORC ;

CREATE TABLE EventPDFInteract ( eventId b ig in t , lmsName Str ing ,orgName Str ing , courseName Str ing , eventName St r ing ) STORED AS ORC ;

CREATE TABLE EventProbInteract ( eventId b ig in t , lmsName Str ing ,orgName Str ing , courseName Str ing , lmsUserId b ig in t , eventNameStr ing , eventNo int , quizzSysName Str ing , q u i z z T i t l e Str ing ,chapterSysName Str ing , c h a p t e r T i t l e Str ing , h in tAva i l ab l e Str ing ,hintMode Str ing , inputType Str ing , responseType Str ing , va r i an t IdStr ing , o ldScore Double , newScore Double , maxGrade Double , attemptsint , maxAttempts int , cho i c e Str ing , s u c c e s s Str ing , source Str ing ,probSubTime timestamp , done Str ing , createDateTime timestamp ,lastModDateTime timestamp , courseRun St r ing ) STORED AS ORC ;

CREATE TABLE EventVideoInteract ( eventId b ig in t , sessionSysNameStr ing , lmsName Str ing , orgName Str ing , courseName Str ing ,courseRun Str ing , lmsUserId b ig in t , eventName Str ing , eventNo int ,videoSysName Str ing , v i d e o T i t l e Str ing , chapterSysName Str ing ,c h a p t e r T i t l e Str ing , oldSeekTime f loat , currSeekTime f loat ,videoNavigType Str ing , oldSpeed f loat , currSpeed f loat , sourceStr ing , createDateTime timestamp , lastModDateTime timestamp )STORED AS ORC ;

CREATE TABLE EventWikiInteract ( eventId b ig in t , lmsName Str ing ,orgName Str ing , courseName Str ing , eventName St r ing ) STORED AS ORC ;

CREATE TABLE LMSList ( LMSShortName Str ing , LMSFullName Str ing ,DateTimeFormat St r ing ) STORED AS ORC ;

CREATE TABLE Sta te s ( id int , name St r ing ) STORED AS ORC ;

CREATE TABLE StudentCourseAccessRole ( id int , lmsUserId int ,orgName Str ing , courseName Str ing , courseRun Str ing , role St r ing )STORED AS ORC ;

CREATE TABLE StudentCourseEnrolment ( e n r o l I d b ig in t , lmsName Str ing ,orgName Str ing , courseName Str ing , courseRun Str ing , lmsUserIdb ig in t , enrolmentDate timestamp , a c t i v e Str ing , mode St r ing )STORED AS ORC ;

CREATE TABLE StudentCourseGrades ( id b ig in t , lmsName Str ing ,orgName Str ing , courseName Str ing , courseRun Str ing , lmsUserIdb ig in t , lmsUserName Str ing , moduleType Str ing , moduleSysName Str ing ,s co r e int , maxScore int , noOfAttempts int , hintUsed Str ing ,h in tAva i l ab l e Str ing , s t a t e Str ing , goa l s Str ing , createDateTimetimestamp , lastModDateTime timestamp , totSessDuraInSecs int , doneSt r ing ) STORED AS ORC ;

CREATE TABLE URLTree ( u r l I d b ig in t , lmsName Str ing , orgName Str ing ,courseName Str ing , courseRun Str ing , urlSysName Str ing , urlTypeStr ing , parentUrl S t r ing ) STORED AS ORC ;

CREATE TABLE User ( use r Id b ig in t , lmsUserId b ig in t , lmsName Str ing ,orgName Str ing , name Str ing , gender Str ing , r e g i s t r a t i o n D a t e date ,emai l Id Str ing , mothertounge Str ing , highestEduDegree Str ing , goa l sStr ing , c i t y Str ing , s t a t e Str ing , a c t i v e int , f i r s tAcc e sDat etimestamp , l a s tAcces sDate timestamp , a l lowCert int , yearOfBirth int ,p incode int , aadharId St r ing ) STORED AS ORC ;

27

APPENDIX B. QUERIES TO CREATE TABLES IN HIVE

CREATE TABLE UserSess ionOld ( s e s s i o n I d b ig in t , sessionSysName Str ing ,lmsName Str ing , orgName Str ing , courseName Str ing , courseRun Str ing ,lmsUserId b ig in t , userName Str ing , agent Str ing , hostName Str ing ,ipAddress Str ing , createDateTime timestamp , eventType Str ing ,eventSource Str ing , eventName Str ing , dataSource Str ing ,oldVideoSpeed f loat , currVideoSpeed f loat , oldVideoTime f loat ,currVideoTime f loat , videoNavigType Str ing , oldGrade f loat ,currGrade f loat , maxGrade f loat , attempts int , maxNoAttempts int ,h i n tAva i l ab l e Str ing , hintUsed Str ing , c u r r P o s i t i o n int , o l d P o s i t i o nint , chapterSysName Str ing , c h a p t e r T i t l e Str ing , sessSysName Str ing ,s e s s T i t l e Str ing , moduleSysName Str ing , moduleTit le Str ing ,answerChoice Str ing , s u c c e s s Str ing , done Str ing , enrolmentModeStr ing , totDurat ionInSecs int , eventNo int , o t h e r T i t l e Str ing ,otherType Str ing , anonymous Str ing , anonymousToPeers Str ing ,eduLevel Str ing , gender Str ing , commentableId Str ing , commentTypeStr ing , commentSysId Str ing , aadhar Str ing , problemSubmissionTimetimestamp , hintMode Str ing , currentSeekTime f loat , queryText Str ing ,noOfResults int , lastModDateTime timestamp ) STORED AS ORC ;

CREATE TABLE UserSessionOldLog ( s e s s i o n I d b ig in t , sessionSysNameStr ing , lmsName Str ing , orgName Str ing , courseName Str ing ,courseRun Str ing , lmsUserId b ig in t , userName Str ing , agent Str ing ,hostName Str ing , ipAddress Str ing , createDateTime timestamp ,eventType Str ing , eventSource Str ing , eventName Str ing , dataSourceStr ing , oldVideoSpeed f loat , currVideoSpeed f loat , oldVideoTimef loat , currVideoTime f loat , videoNavigType Str ing , oldGrade f loat ,currGrade f loat , maxGrade f loat , attempts int , maxNoAttempts int ,h i n tAva i l ab l e Str ing , hintUsed Str ing , c u r r P o s i t i o n int , o l d P o s i t i o nint , chapterSysName Str ing , c h a p t e r T i t l e Str ing , sessSysName Str ing ,s e s s T i t l e Str ing , moduleSysName Str ing , moduleTit le Str ing ,answerChoice Str ing , s u c c e s s Str ing , done Str ing , enrolmentModeStr ing , totDurat ionInSecs int , eventNo int , o t h e r T i t l e Str ing ,otherType Str ing , anonymous Str ing , anonymousToPeers Str ing ,eduLevel Str ing , gender Str ing , commentableId Str ing , commentTypeStr ing , commentSysId Str ing , aadhar Str ing , problemSubmissionTimetimestamp , hintMode Str ing , currentSeekTime f loat , queryText Str ing ,noOfResults int , lastModDateTime timestamp ) STORED AS ORC ;

CREATE TABLE eventColor ( eventTYpeId int , eventType Str ing ,eventColor Str ing , colorCode St r ing ) STORED AS ORC ;

CREATE TABLE eventNoTYpe ( id int , eventNo int , eventTypeId int )STORED AS ORC ;

CREATE TABLE myUserSession ( s e s s i o n I d b ig in t , sessionSysName Str ing ,lmsName Str ing , orgName Str ing , courseName Str ing , courseRun Str ing ,lmsUserId b ig in t , userName Str ing , agent Str ing , hostName Str ing ,ipAddress Str ing , createDateTime timestamp , eventType Str ing ,

eventSource Str ing , eventName Str ing , dataSource Str ing ,oldVideoSpeed f loat , currVideoSpeed f loat , oldVideoTime f loat ,currVideoTime f loat , videoNavigType Str ing , oldGrade f loat ,currGrade f loat , maxGrade f loat , attempts int , maxNoAttempts int ,h i n tAva i l ab l e Str ing , hintUsed Str ing , c u r r P o s i t i o n int , o l d P o s i t i o nint , chapterSysName Str ing , c h a p t e r T i t l e Str ing , sessSysName Str ing ,s e s s T i t l e Str ing , moduleSysName Str ing , moduleTit le Str ing ,answerChoice Str ing , s u c c e s s Str ing , done Str ing , enrolmentModeStr ing , totDurat ionInSecs int , eventNo int ) STORED AS ORC ;

28

APPENDIX B. QUERIES TO CREATE TABLES IN HIVE

CREATE TABLE tmpCourseTable ( lmsUserId b ig in t , courseName Str ing ,createDateTime timestamp , modDateTime timestamp , moduleSysNameSt r ing ) STORED AS ORC ;

CREATE TABLE tmpEventDescrip ( eventType Str ing , eventSt r ing Str ing ,logdirName Str ing , log f i l eName St r ing ) STORED AS ORC ;

CREATE TABLE tmpProbInteract ( eventId b ig in t , us r Id b ig in t ,problemId b ig in t , newScore int , maxGrade int , attempts int , s u c c e s sStr ing , done St r ing ) STORED AS ORC ;

CREATE TABLE tmpProblem ( problemId b ig in t , q u i z T i t l e Str ing ,noOfAttemptsAllowed int , quizMaxMarks f loat ) STORED AS ORC ;

CREATE TABLE tmpTime ( idTmpTime int , startDateTime timestamp ,endDateTime timestamp , totTimeSpent int , strTimeSpent St r ing )STORED AS ORC ;

29

Appendix C

Script To Draw Replayed Video PartGraphs

Listing C.1: Script to draw Graphs as in Section 5.7

#!/ bin / bash

#output x l a b l e y l a b e l t i t l e f i l ename

gnusc r ip t ( ){

echo −n ” r e s e ts e t te rmina l p o s t s c r i p t eps font ’ , 10 ’s e t output ’ $1 ’s e t key font ’ , 10 ’s e t s i z e squares e t encoding ut f8#s e t s t y l e f i l l t ransparent s o l i d 0 .5 noborders e t s t y l e func t i on f i l l e d c u r v e ss e t c l i p two#s e t key out s id es e t key top l e f ts e t x l a b e l \”$2\”s e t y l a b e l \”$3\”s e t t i t l e \”$4\”s e t yrange [ 0 : ]s e t xrange [ 0 : ]unset co lorboxp lo t ’ ”echo −n $5echo ” ’ us ing 1 :2 with f i l l e d c u r v e s above x1=0 l c rgb \” gold \” n o t i t l e ”

}

rm −f CS101/ graphs /∗rm −f CS101/Data/∗

python VideoCountReplay . py

for f i l e in CS101/Data /∗ . tx tdo

fu l lname=${ f i l e %.∗}name=${ fu l lname##∗\/}

gnusc r ip t ”CS101/ graphs /$name . eps ” ”Time from s t a r t o f v ideo in Second” ”Number o f Times Video part Replayed” ”Video Replay f o r $name” ” $ f i l e ” > temp . pgnuplot temp . p

30

APPENDIX C. SCRIPT TO DRAW REPLAYED VIDEO PART GRAPHS

done

Listing C.2: Script to Find Ranges for replayed Videos in CS101 5.7

#!/ usr / b in / python

import sysimport osimport array

video ={};rang={}

with open( ’ /home/ hduser / resultRnD /CS101/Vid . txt ’ ) as fp :for l i n e in fp :

tok=l i n e . s p l i t ( )# video [ ( tok [ 0 ] , tok [ 1 ] ) ] = ( tok [ 2 ] , tok [ 3 ] )

video [ tok [ 0 ] ] = [ 0 for x in range ( ( int ( tok [ 2 ] ) + 1 ) ) ]

with open( ’ /home/ hduser / resultRnD /CS101/VidRanges . txt ’ ) as fp :for l i n e in fp :

tok=l i n e . s p l i t ( )for i in range ( int ( tok [ 1 ] ) , int ( tok [ 2 ] ) ) :

v ideo [ tok [ 0 ] ] [ i ]+=1;

for key , abc in video . i t e r i t e m s ( ) :f 1=open( ”/home/ hduser / resultRnD /CS101/Data/CS101x15 ”+str ( key)+” . txt ” , ’w ’ )#p r i n t keyfor i in range ( len ( abc ) ) :

f 1 . wr i t e ( str ( i )+” ”+str ( abc [ i ])+”\n” )f1 . c l o s e ( ) ;

31

Appendix D

Script to draw Forwarde Video partGraph

Listing D.1: Script to draw Graphs as in Section 5.7

#!/ bin / bash

#output x l a b l e y l a b e l t i t l e f i l ename

gnusc r ip t ( ){

echo −n ” r e s e ts e t te rmina l p o s t s c r i p t eps font ’ , 10 ’s e t output ’ $1 ’s e t key font ’ , 10 ’s e t s i z e squares e t encoding ut f8#s e t s t y l e f i l l t ransparent s o l i d 0 .5 noborders e t s t y l e func t i on f i l l e d c u r v e ss e t c l i p two#s e t key out s id es e t key top l e f ts e t x l a b e l \”$2\”s e t y l a b e l \”$3\”s e t t i t l e \”$4\”s e t yrange [ 0 : ]s e t xrange [ 0 : ]unset co lorboxp lo t ’ ”echo −n $5echo ” ’ us ing 1 :2 with f i l l e d c u r v e s above x1=0 l c rgb \” gold \” n o t i t l e ”

}

rm −f CS101 forward / graphs /∗rm −f CS101 forward /Data/∗

python VideoCountSkip . py

for f i l e in CS101 forward /Data /∗ . tx tdo

fu l lname=${ f i l e %.∗}name=${ fu l lname##∗\/}

gnusc r ip t ” CS101 forward / graphs /$name . eps ” ”Time from s t a r t o f v ideo in Sec ” ”Number o f Times Video part Forwarded””Video Forward f o r $name” ” $ f i l e ” > temp . p

32

APPENDIX D. SCRIPT TO DRAW FORWARDE VIDEO PART GRAPH

gnuplot temp . p

done

Listing D.2: Script to Find Ranges for replayed Videos in CS101 5.7 VideoCountSkip.py

#!/ usr / b in / python

import sysimport osimport array

video ={};rang={}

with open( ’ /home/ hduser / resultRnD / CS101 forward /Vid . txt ’ ) as fp :for l i n e in fp :

tok=l i n e . s p l i t ( )# video [ ( tok [ 0 ] , tok [ 1 ] ) ] = ( tok [ 2 ] , tok [ 3 ] )

video [ tok [ 0 ] ] = [ 0 for x in range ( ( int ( tok [ 2 ] ) + 1 ) ) ]

with open( ’ /home/ hduser / resultRnD / CS101 forward /VidRanges . txt ’ ) as fp :for l i n e in fp :

tok=l i n e . s p l i t ( )for i in range ( int ( tok [ 1 ] ) , int ( tok [ 2 ] ) ) :

v ideo [ tok [ 0 ] ] [ i ]+=1;

for key , abc in video . i t e r i t e m s ( ) :f 1=open( ”/home/ hduser / resultRnD / CS101 forward /Data/CS101x15 ”+str ( key)+” . txt ” , ’w ’ )#p r i n t keyfor i in range ( len ( abc ) ) :

f 1 . wr i t e ( str ( i )+” ”+str ( abc [ i ])+”\n” )f1 . c l o s e ( ) ;

33

References

[1] Apache derby documentation. Retrieved Feb 2, 2016 from https://db.apache.org/derby/.

[2] Apache hadoop documentation. Retrieved Feb 2, 2016 from http://hadoop.apache.org/.

[3] Apache tez documentation. Retrieved April 30, 2016 from https://tez.apache.org/.

[4] Sandeep Kale. Iitbxdataanalysis: Log cleaning and loading code. Retrieved May 1,2016 from http://www.it.iitb.ac.in/frg/wiki/index.php/G8_-_Large_System_Integration#

Sandeep_Nathu_Kale.

[5] Tutorails Point. Apache hive tutorial. Retrieved Feb 2, 2016 from http://www.tutorialspoint.

com/hive/.

[6] Apache Hive Team. Apache hive documentation. Retrieved Feb 2, 2016 from https://hive.apache.

org/.

[7] EdX Team. edx research guide, 2014. Retrieved Feb 2, 2016 from http://edx.readthedocs.io/

projects/devdata/en/latest/index.html.

34