Raisul Islam Rashu*, Syed Tanveer Jishan, Naheena Haq and …€¦ · Syed Tanveer Jishan is an...

Int. J. Knowledge Engineering and Soft Data Paradigms, Vol. 5, No. 1, 2015 1

Copyright © 2015 Inderscience Enterprises Ltd.

Implementation of optimum binning, ensemble learning and re-sampling techniques to predict student’s performance

Raisul Islam Rashu*, Syed Tanveer Jishan, Naheena Haq and Rashedur M. Rahman Department of Electrical and Computer Engineering, North South University, Plot-15, Block-B, Bashundhara, Dhaka 1229, Bangladesh Email: [email protected] Email: [email protected] Email: [email protected] Email: [email protected] *Corresponding author

Abstract: Educational data-mining is an emerging area of research that could extract useful information for the students as well as for the instructors. In this research, we explore data mining techniques that predict students’ final grade. We validate our method by conducting experiments on data that are related to grade for courses in North South University, the first private university and one of the leading universities in higher education in Bangladesh. We also extend our ideas through discretisation of the continuous attributes by equal width binning and incorporate it on traditional mining algorithms. However, due to imbalanced nature of data, we got lower accuracy for imbalanced classes. We implement two re-sampling techniques, i.e., ROS (random over sampling), RUS (random under sampling). Experimental results show that re-sampling techniques could overcome the problem of imbalanced dataset in classification significantly and improve the performance of the classification models. Moreover, three ensemble techniques, namely, bagging, boosting (AdaBoost) and random forests have been applied in this research to predict the students’ academic performance.

Keywords: educational data mining; EDM; classification; Naive Bayes; decision tree; neural network; discretisation; equal width binning; ensemble learning.

Reference to this paper should be made as follows: Rashu, R.I., Jishan, S.T., Haq, N. and Rahman, R.M. (2015) ‘Implementation of optimum binning, ensemble learning and re-sampling techniques to predict student’s performance’, Int. J. Knowledge Engineering and Soft Data Paradigms, Vol. 5, No. 1, pp.1–30.

Biographical notes: Raisul Islam Rashu is working as an Instructor in the Department of Electrical and Computer Engineering in North South University, Dhaka, Bangladesh. He received his BSc in Computer Science and Engineering from North South University, Bangladesh in the year of 2014. He also worked as an Undergraduate Teaching Assistant (UGA) in the Department of Electrical and Computer Engineering in North South University. He has published several conference and journal papers in the area of data mining on educational data.

2 R.I. Rashu et al.

He has more than one year of experience in research. His expertise include: educational data mining, knowledge discovery and expert system design, knowledge and data engineering.

Syed Tanveer Jishan is an Instructor in the Department of Electrical and Computer Engineering at the North South University, Bangladesh. He received his BSc in Computer Science from the North South University, Bangladesh in 2014. He has approximately one and half years of experience in research. His expertise include: educational data mining, knowledge discovery and expert system design, privacy preserving data publishing and mining.

Naheena Haq received her BSc in Computer Science and Engineering from North South University, Bangladesh in the year of 2014. She has one year of experience in research. She has couple of publications in the area of data mining.

Rashedur M. Rahman is working as an Associate Professor in Electrical and Computer Engineering Department in North South University, Dhaka, Bangladesh. He received his PhD in Computer Science from the University of Calgary, Canada and Masters degree from the University of Manitoba, Canada in 2007 and 2003 respectively. He has published more than 90 research articles in peer reviewed journals and conference proceedings, mainly in the area of parallel, distributed, grid and cloud computing, knowledge and data engineering. He serves as an organising committee member of different international conferences organised by IEEE and ACM in home and abroad.

1 Introduction

Data mining technique can be applied in various fields like business, marketing, management, medicine, engineering, etc. Educational data mining is a new emerging technique of data mining in the field of education. Educational data can be from different sources but generally from academic institutions. However, online learning systems are new environment for generating educational data which can be used to analyse and extract useful information (Romero and Ventura, 2010).

The goal of this paper is to predict the students’ performance using the CGPA, quiz, laboratory, midterm and attendance marks so that the students can be alerted before the final examination regarding their grade outcome. This will not only help the students but also the instructors who will get an insight on how the students are doing in the course.

For this work we acquired dataset of a particular course from North South University. After acquiring the data we preprocessed it and then applied several classification algorithms such as Naïve Bayes, decision tree and neural network. We also discretised the continuous attributes using optimal equal width binning as proposed by Kaya (2008) and record the accuracy with the model having continuous attributes’ class estimated using probability density function. We have manipulated dataset not only with optimal equal binning but also with the combination of ensemble techniques to build the models using three classification techniques. However, in some classes we do not get good accuracy due to the fact that data is imbalanced, i.e., some grades are obtained only by very few numbers of students. Therefore, we look into the techniques at the data pre processing level, i.e., re-sampling methods. We employed two re-sampling techniques

Implementation of optimum binning, ensemble learning and re-sampling 3

including random over sampling (ROS), and random under sampling (RUS). With the use of the rebalanced data we trained three different classifiers, i.e., decision tree, Naïve Bayes and neural network and use them to classify the test set. After all the models are built we compared their accuracy, precision, recall and f-measure of the class labels for those models.

The rest of the paper is organised as follows. Section 2 describes the research works that are related to data mining, educational data mining, techniques that are generally used to deal class imbalance problems. Section 3 discusses data mining process that includes the data pre processing, attribute transformation, the parameters for classification methods, bin partitioning and re-sampling techniques as well as the ensemble methods used in this research in detail. Section 4 presents the experimental results and makes a comparative study of classification methods. Finally, Section 5 concludes and gives directions of future research.

2 Related work

The understanding of learning process can be enhanced by applying data mining in educational sector. The class imbalance problem is prevalent in many applications, including: fraud/intrusion detection, risk management, text classification, and medical diagnosis/monitoring, etc (Nitesh et al., 2002). Bharadwaj and Pal (2012) used students’ family status, living location, medium of teaching, mother’s qualification, family annual income, and students’ grade in senior secondary exam to predict how the students perform academically.

In another study Bharadwaj and Pal (2011) used students’ class test grade, assignment performance, general proficiency, attendance in class and lab work, previous semester marks, seminar performance to predict end of the semester marks. Ayers et al. (2009) used several clustering algorithms such as k-means clustering, hierarchical agglomerative clustering and model based clustering to understand skill levels of the students and group them based on their skill sets.

Rus et al. (2009) used classification algorithms like Naïve Bayes, Bayes net, support vector machines, logistic regression and decision trees to build a system which can detect student’s mental model. Pal and Pal (2013) used some features such as high school grade and senior secondary grade as well as other attributes related to student’s family to identify the students who need special advising or counseling from the teachers. They conducted the study at VBS Purvanchal University, Jaunpur, India and used classification algorithms.

Yadav et al. (2012) used the decision tree algorithms such as ID3, CART and C4.5 and made a comparative analysis to predict students’ performance at the end of the semester. They used students’ attendance, class test grade, seminar, assignment marks, and lab works as features. In their study they achieved 52.08%, 56.25% and 45.83% accuracy for each of these classification techniques respectively. Nitesh et al. (2002) found that majority class and minority class both have to equally represent in classification category for balanced dataset. They used combination of the method of over sampling the minority class and under sampling the majority class to accomplish the better classifier performance in ROC space. They mainly introduced the synthetic

4 R.I. Rashu et al.

minority over-sampling approach which provides the new technique in over sampling and intercourse with the under sampling makes the better result.

Chen (2008) used several re-sampling techniques for finding the maximum accuracy of classification from fully labelled imbalanced training dataset. SMOTE, oversampling by duplicating minority examples, RUS, is mainly used to create new training dataset. Standard classifiers like decision tree, Naive Bayes, neural network are trained in this dataset and all the techniques show improved accuracy except Naive Bayes.

The objective of the study conducted by Pandey and Taruna (2014) is to assess the accuracy of the ensemble techniques for predicting the student’s academic performance. In this study, five ensemble techniques based on four learning algorithms, namely, AdaBoost, bagging, random forest and rotation forest have been compared for the same number (ten) of base classifiers. They have found that the performance of random forest algorithm is lowest whereas the performance of rotation forest algorithm is highest amongst all algorithms in terms of model accuracy for predicting the student’s performance at the primary stages of four year engineering graduate program. Moreover, they have observed that the performance of AdaBoost and bagging ensembles are better than random forest and close to rotation forest algorithm.

Another study for MCA students have been conducted by Sharaf et al. (2013) to predict the result in final examination based on their internal marks. In their study, they have proposed two decision tree algorithms, namely ID3 and C4.5 for the prediction. The predicted result has been compared with the original result to demonstrate the accuracy of the suggested model.

Although there are many handful works on grade prediction models, our focus is to handle the continuous attributes effectively. Handling the continuous attributes effectively will better predict the grade as most of the attributes in course mark sheets or datasets are continuous in nature. Besides, we want to overcome the imbalance dataset problem. Overcoming the imbalance dataset problem effectively will result in better performance of the grade prediction models.

3 Data mining process

We have collected dataset of the course numerical analysis from North South University. The dataset contains record of 5 semesters, having 181 instances. Each instance is a student record.

3.1 Data selection

Originally the dataset had student ID, student name, five quiz marks, midterm marks, attendance, laboratory marks, final marks and final grade as attributes. We have selected the attribute which contains the percentage of marks obtained by the students in quizzes rather than taking all the quizzes into account. We have discarded the final marks as our goal is to predict the final grade before the final examination being taken. Final grade is considered as the class label.


3.2 Data preparation

We have discarded the student ID and name attributes as these are not directly required for the data analysis purpose. Students’ CGPA, which was not initially a part of the dataset, was retrieved and added as an attribute. Quiz, midterm and laboratory marks were normalised as there was fluctuation of weight assigned to each of these attributes in different semesters.

All the attributes are described in details below.

• CGPA – Cumulative grade point average. CGPA of students who enrolled in the course are taken into consideration.

• Attendance marks – Normalisation will have no effect on the attendance marks provided it had binary marking for every instances, that is either 0 or full marks. However normalisation is done in order to keep consistency in terms of look with the other attributes.

• Quiz marks – Best four out of the five quizzes are counted as per the course policy. The average of the best four quizzes are taken and then normalised between 0 to 100.

• Midterm marks – Midterm examination marks are also normalised between 0 to 100. Generally, only one midterm was taken every semester except in one of them when two midterms were taken.

• Laboratory marks – Laboratory marks are also normalised between 0 to 100. However, there was not much fluctuation in the weight of laboratory marks in different semesters.

• Final grade – Classification techniques are used to predict the final grade. Class labels of final grades are A, B, C, D and F.

3.3 Probability estimation on continuous data

In case of continuous data it is required to estimate the class. One of the methods is to assume a certain form of probability distribution for the continuous data. The most common form of probability distribution that is chosen is Gaussian distribution which is stated in equation (1). In this equation, Ai is the ith instance of the attribute A and cj is the jth class label. The symbol, µ, stands for the population mean and the symbol, σ2, stands for variance of the given population.

( )( )2

22

2

12

i ij

ij

A μσ

i jij

P A c eπσ

−

= (1)

We used probability distribution function on the continuous attributes in the dataset in one of the three classification models presented in the paper.

6 R.I. Rashu et al.

3.4 Basic classification techniques

3.4.1 Naive Bayes

Naïve Bayes is a method of classification in which Bayes theorem is applied. It is a simple probabilistic classifier. It can be done by evaluating an approximation algorithm rather than by iterative approximation which is expensive. Naïve Bayes classifier can be trained efficiently in supervised learning. This classifier considers each feature to contribute independently to the probability and needs only a small amount of training data for classification.

We use Naïve Bayes classification to create two different models. To estimate the class labels for continuous attributes we used probability distribution function (PDF) and then applied Naïve Bayes classification. Another model was build using a discretisation method (Kaya, 2008) to estimate the class labels for continuous data and after the discretisation of the continuous attributes we used Naïve Bayes for classification.

3.4.2 C4.5 algorithm for classification

C4.5 is an extension of iterative dichotomiser 3 (ID3). It is an algorithm which generates decision tree from a dataset. In this technique we need to calculate entropy of every attribute of dataset and then we have to split the dataset into subsets using the attributes with minimum entropy or maximum information gain. One of the major extensions of C4.5 from ID3 is that it accepts both continuous and discrete features, handles incomplete data points and different weights can be applied on the features that comprise of the training data (Quinlan, 1993). We split the data using gain ratio and minimal size for split was set to 4. That means those nodes with number of records more than or equal to 4 will be split.

3.4.3 Back propagation algorithm for classifcation

Back propagation stands for backward propagation of errors. It is a method of artificial neural network. It is used along with an optimisation method called gradient descent. Back propagation needs a known output for each input value to calculate the loss function. It is considered as a supervised learning method. It is also used in some unsupervised networks like auto encoders. Activation function used by artificial neurons is differentiable which is required in back propagation. The back propagation algorithm is divided into two phases: propagation and weight update (Wikipedia, Backpropagation).

In our model, which is shown in Figure 1, we used three layers of neurons: input layer, hidden layer and output layer. In input layer, the numbers of neurons are six. They are basically attributes of the dataset, such as quiz, midterm, laboratory, attendance, CGPA and one extra bias. In output layers, the number of neurons is five and it represents the class label of the course grade. In hidden layer, the number of neurons is six with one extra bias which makes the total number of neurons to seven. The number of neurons for hidden layer is calculated using the equation. The training of the neural network was done for 350 cycles with a learning rate of 0.1 and momentum 0.17.

. . . 12

No of Attributes No of ClassesNo of Neurons += +


Figure 1 Design of the neural network (see online version for colours)

3.5 Extended data mining methods

3.5.1 Discretisation of continuous data using data binning

Continuous features can be transformed into nominal features through a data pre-processing method called data binning. In this process, the continuous data is broken down into smaller number of bins. Equal width binning is one of the simplest approaches of discretising the data. The idea is to divide the range of the continuous data into k number of bins, where each bin contains the same width.

In this paper, we have implemented a discretisation technique proposed by Kaya (2008) which is based on equal width binning and error minimisation. For a continuous attribute we will dynamically search for the bin width value until we find the optimal one. Secondly, datasets can have more than one continuous attribute. Therefore, finding optimal bin width value for all the continuous attributes in the dataset will result in better overall performance.

Figure 2 Accuracy fluctuation due to bin width values (see online version for colours)

8 R.I. Rashu et al.

In Figure 2, a bar graph is shown which represents how bin width value affects the accuracy of the classification result. On the x-axis, bin width values from 2 to 8 are provided. On the y-axis, accuracy of the classification of the dataset is provided. We can observe that when the bin width value is set to 4 we get the highest accuracy. Thus the optimal bin with value for the continuous attribute CGPA is 4.

3.5.2 Random under sampling

RUS is a non-heuristic approach where the number of samples of the majority class is reduced to generate a balanced dataset. According to under sampling method, class distribution is maintained in such a way that samples of the minority class are preserved while on the other hand, some samples of the majority class are chosen as a subset of samples from the majority class randomly. After selecting the subset of samples of majority class, samples of minority class are combined with the majority class and make a new training set. Thus, the size of majority class is decreased to match with the size of minority class (Bekkar and Alitouche, 2013). RUS is explained in Figure 3.

Figure 3 Technique of RUS (see online version for colours)


3.5.3 Random over sampling

Random over-sampling is the simplest re-sampling approach to oversampling. It is a non-heuristic method which forms a balanced dataset by increasing the number of samples of the minority class. Here instances from the minority class are selected randomly. According to this method, class distribution is made in such a way that minority class holds the duplicate copies of examples from the minority class so that the minority class is in balance with the majority class. After duplicating the samples, they are then added to the new training set. All the samples from the majority and the minority class are preserved. As a result, important features of the samples from the original dataset are available. RUS is explained in Figure 4.

Figure 4 Technique of ROS (see online version for colours)

10 R.I. Rashu et al.

3.5.4 Ensemble method (bagging, boosting and random forest)

The ensemble classification is based on the viewpoint that a single expert is less likely to provide more precise decisions in comparison to a group of experts. Ensemble modelling congregates the set of classifiers to build a single composite model which provides better precision. Researches show that composite model estimation provides better results than single model estimation. Actually it is the last decade when the study in the field of ensemble methods became widespread. The machine learning researchers, after carrying out a number of experimental studies, have proved that combining the outputs of multiple classifiers lessens the generalisation error (Pandey and Taruna, 2014). More details of the ensemble method could be found in Tan et al. (2006); a short description is given in Algorithm 1.

Algorithm 1 General procedure for ensemble method

1 Let D denote the original training data, k denote the number of base classifiers, and T be the test data.

2 for i = 1 to k do

3 Create training set, Di from D.

4 Build a base classifier Ci from Di.

5 end for

6 for each test record x ε T do

7 C˙(x) = Vote(C1(x),C2(x),…,Ck(x))

8 end for

Bagging depends on bootstrap (Efron and Tibshirani, 1993) sampling method. In each of the iterations, a varying collection of bootstrap specimen is generated for building the distinct classifier of the same algorithm. The data item is chosen arbitrarily with replacement in bootstrap sampling technique; which implies that either some instance can be repeated or some of them can be left out from the actual dataset during the sampling steps. The following step of bagging is to combine all the classifiers created in the earlier stage. Bagging combines the outcome of the classifiers with the help of voting to create the final prediction. According to Breiman (1996) Bagging is an effective ensemble method for unstable learning algorithms where minor changes in the training dataset results in big changes in predictions, such as decision tree, neural network, etc. (Pandey and Taruna, 2014).

More details of the bagging algorithm could be found in Tan et al. (2006), a short description is given in Algorithm 2.

Algorithm 2 Bagging algorithm

1 Let k be the number of bootstrap samples 2 for i = 1 to k do 3 Create a bootstrap sample of size N, Di. 4 Train a base classifier Ci on the bootstrap sample Di. 5 end for


6 ( )1

(x) = argmax ( ) ) .iy i

C δ C x y=

=∑˙

{δ(·) = 1 if its argument is true and 0 otherwise}.

Boosting boosts the performance of the weak classifier to a robust level. It produces sequential learning classifiers using re-sampling (re-weighting) of the data instances. Initially all the instances are assigned with equal uniform weights. During each learning phase, a new hypothesis is learned and the instances are re-weighted such that a correctly classified instance having lower weight and system can concentrate on instances having higher weights that have not been correctly classified during this phase. The inaccurately classified instances are selected so that they can be classified accurately during the subsequent learning step. This process continues till the construction of the last classifier. Finally, the results of all the classifiers are combined by the process of majority voting to find the final prediction. AdaBoost (Efron and Tibshirani, 1993) is a more general version of the boosting algorithm.

More details of the AdaBoost algorithm could be found in Tan et al. (2006); a short description is given in Algorithm 3.

Algorithm 3 Adaboost algorithm

1 w = {wj = 1/N | j = 1, 2,….,N}. {Initialise the weights for all N examples} 2 Let k be the number of boosting rounds. 3 for i = 1 to k do 4 Create training set Di by sampling (with replacement) from D according to w. 5 Train a base classifier Ci on Di. 6 Apply Ci to all examples in the original training set, D.

7 ( )( )1i j i j jε w δ C x y

N⎡ ⎤≠⎣ ⎦ {Calculate the weighted error.}

8 if ε i > 0.5 then 9 w = {wj = 1/N | j = 1, 2,….,N}. {Reset the weights for all N examples.} 10 Go back to Step 4. 11 end if

12 1 1ln2

ii

i

εε−

α =

13 Update the weight of each example according to equation 5.69. 14 end for

15 ( )1

(x) = argmax ( ) ) .N

i iy j

C δ C x y=

α =∑˙ .

Random forest is proposed by Breiman (2001) specifically for trees. It is the combination of bagging and random subspace technique for inducing the tree. It is similar to bagging except that each model is a random tree rather than a single model and each tree is grown according to the bootstrap sample of the training set to N. Another random step is used to split each node. A small subset of feature m is selected arbitrarily ((m < M) rather than


considering all possible splits M. The best split is chosen from this subset. Final classification is done using the process of majority voting across trees. More details of random forest could be found elsewhere (Tan et al., 2006).

3.6 Implementation of methods

In order to build the models and analyse them we used the data mining tool RapidMiner 5. First we import the pre-processed dataset .xlsx file and further process it in RapidMiner. While pre-processing in the RapidMiner we selected the grade attributes as the response variable and considered it as nominal type whereas the rest of the attributes are considered as numeric type. After that, we pass the pre-processed dataset into the validation system in which the classification model is built. In order to classify data we need to integrate a classification system in RapidMiner. For example in this figure the neural network classification system is used. After applying the model we need to measure the performance using the testing dataset. The process is depicted in Figure 5 which shows the blocks of the neural network classification model.

Figure 5 Building a classification model in RapidMiner 5 (see online version for colours)

We used the data mining tool Weka in order to implement RUS technique. Figure 6 depicts this. For implementing ROS technique, we used data mining tool RapidMiner 5. After applying the techniques we re-sample the data in order to overcome the problem with imbalanced data. Figure 7 depicts this.


Figure 6 Class distributions of data before re-sampling with RUS in Weka (see online version for colours)

Figure 7 Class distributions of data after re-sampling with RUS in Weka (see online version for colours)


Figure 8 ROS implementation in RapidMiner 5 (see online version for colours)

After re-sampling with the techniques we saved the dataset in .csv file and then covert the .csv file into excel sheet (.xlsx file). Then, we import the pre-processed dataset, i.e., the excel sheet (.xlsx file) to further pre-process in RapidMiner.

4 Results

We assigned 20% of the original dataset as the testing set to measure the accuracy of the models. Stratified sampling which is a method of sampling is used because it gives better coverage of the whole population. The final accuracy of the model is measured by taking the average of the five iterations. Table 1 Bin width values for the classification methods

Decision tree Naïve Bayes Neural network

Quiz 3 7 6 Midterm 7 6 8 Laboratory 5 6 4 Attendance 2 2 2 CGPA 4 4 4

4.1 Naive Bayes classification

4.1.1 Probability estimation on continuous data

For this model we estimated the class label of the continuous attributes using probability distribution function and then we build the model using Naive Bayes classification. The accuracy of the model is 61.11%. The worst class precision comes from the prediction of


class D as show in Table 2. The model inaccurately predicts instances of class C and class F belonging to class D of roughly about 58% of the time.

Table 2 Detailed analysis of the Naive Bayes model

True C True A True D True F True B Class precision Pred. C 33 0 8 2 16 55.93% Pred. A 0 22 0 0 15 59.46% Pred. D 6 0 9 6 0 42.86% Pred. F 0 0 1 1 0 50.00% Pred. B 9 6 0 1 45 73.77% Class recall 68.75% 78.57% 50.00% 10.00% 59.21% F-measure 61.68% 67.69% 46.15% 16.66% 65.69%

4.1.2 Optimal equal width binning on continuous data and Naive Bayes classification

Optimal equal width binning as proposed by Kaya (2008) is used to discretise every continuous attributes and then the model is build using the Naive Bayes classification. All the optimal bin value for the continuous attributes are listed in Table 1. The model has the accuracy of 68.33% which is 7.22% better than the previous Naive Bayes model. The prediction of the class D is as bad as the first model as show in Table 3. Moreover the precision of the predicted class F worsen for this model. However there is significant improvement for the prediction of class A and class C. In terms of class recall, the result is better than the previous Naive Bayes model. Table 3 Detailed analysis of the Naive Bayes model (optimal equal width binning)


4.1.3 Classification with re-sampled dataset with RUS

We re-sampled the actual dataset with RUS technique to balance the dataset and then we used Naive Bayes classification to build the model. The accuracy of the model is 52.00% which is a little over 9% worse compared to the accuracy of the Naïve Bayes using actual dataset. However this model is the worst model compared to other models of Naïve Bayes in terms of accuracy. The worst class precision comes from the prediction of class F as show in Table 4.


Table 4 Detailed analysis of the Naive Bayes model using re-sampled dataset with RUS

True A True B True F True C True D Class precision Pred. A 9 1 0 0 0 90.00% Pred. B 1 6 1 1 2 54.55% Pred. F 0 0 0 0 0 0.00% Pred. C 0 2 2 6 3 46.15% Pred. D 0 1 7 3 5 31.25% Class recall 90.00% 60.00% 0.00% 60.00% 50.00% F-measure 90.00% 57.15% 0.00% 52.17% 38.46%

4.1.4 Classification with re-sampled dataset with ROS

In this case, we re-sampled the actual dataset with ROS technique to balance the dataset and then we used Naive Bayes classification to build the model. The accuracy of the model is: 64.44% which is a little over 3% better than the accuracy of the Naïve Bayes using actual dataset. The worst class precision comes from the prediction of class D as show in Table 5. Table 5 Detailed analysis of the Naive Bayes model using re-sampled dataset with ROS

True C True A True F True B True D Class precision Pred. C 66 0 4 25 16 59.46% Pred. A 0 41 0 28 0 59.42% Pred. F 0 0 3 0 0 100.00% Pred. B 20 11 2 99 0 75.00% Pred. D 9 0 12 1 23 51.11% Class recall 69.47% 78.85% 14.29% 64.71% 58.97% F-measure 64.08% 67.77% 25.01% 69.48% 54.76%

4.1.5 Classification with bagging

In this case, we have used the dataset with bagging to build the model. The accuracy of the model is 61.11%. The worst class precision comes from the prediction of class D as shown in Table 6. Table 6 Detailed analysis of the Naive Bayes model with bagging



4.1.6 Classification with combination of bagging and binning

In this case, we have used the dataset with combination of bagging and binning to build the model. The accuracy of the model is 67.78%, which is 6.67% better than the previous Naive Bayes model. The worst class precision comes from the prediction of class D as shown in Table 7. Table 7 Detailed analysis of the Naive Bayes model with combination of bagging and binning


4.1.7 Classification with boosting

In this case, we have used the dataset with boosting to build the model. The accuracy of the model is 61.11%. The worst class precision comes from the prediction of class D as shown in Table 8. Table 8 Detailed analysis of the Naive Bayes model with boosting


4.1.8 Classification with combination of boosting and binning

In this case, we have used the dataset with combination of boosting and binning to build the model. The accuracy of the model is 68.33%, which is 7.22% better than the previous Naive Bayes model. The worst class precision comes from the prediction of class D and F as shown in Table 9. Table 9 Detailed analysis of the Naive Bayes model with combination of boosting and binning

True C True A True D True F True B Class precision

Pred. C 34 0 6 0 13 64.15% Pred. A 0 22 0 0 7 75.86% Pred. D 6 0 9 5 1 42.86%


Table 9 Detailed analysis of the Naive Bayes model with combination of boosting and binning (continued)

True C True A True D True F True B Class precision Pred. F 1 0 3 3 0 42.86% Pred. B 7 6 0 2 55 78.57% Class recall 70.83% 78.57% 50.00% 30.00% 72.37% F-measure 67.32% 77.19% 46.16% 35.295% 75.34%

4.2 Decision tree classification

4.2.1 Classification with actual data

For building the model, we use decision tree classifier and the accuracy is 45.56%. From Table 10, we can also notice that the class precision of most of the class is similar to other models except for the class D. This gives us a notion that due to misclassification of class D the overall accuracy for the decision tree model is affected. Table 10 Detailed analysis of the decision tree model

True C True A True D True F True B Class precision Pred. C 29 0 14 7 11 47.54% Pred. A 0 25 0 0 20 55.56% Pewwwred. D 2 0 1 1 0 25.00% Pred. F 0 0 1 2 0 66.67% Pred. B 17 3 2 0 45 67.16% Class recall 60.42% 89.29% 5.56% 20.00% 59.21% F-measure 53.21% 68.49% 9.09% 30.76% 62.93%

4.2.2 Optimal equal width binning on continuous data

Optimal equal width binning is also used to estimate the class labels of the continuous attributes before the model is build using decision tree which is generate by C4.5 algorithm. The optimal bin values for most of the continuous attributes are somewhat smaller than the optimal bin values generated while building the model using Naive Bayes classification. The optimal bin values are listed in Table 11. Table 11 Detailed analysis of the decision tree model (optimal equal width binning)



For the model built using decision tree, the accuracy is 63.89% which is around 18% better than the accuracy of the decision tree having all the continuous attributes discretised into three equal bins. However this model is the worst model compared to other models in terms of accuracy where optimal equal width binning is used. From the Table 10 and Table 11 we can also notice that the class precision of most of the class is similar to other models except for the class D which is 27.78%. This gives us a notion that due to misprediction of class D the overall accuracy for the decision tree model is affected.

Figure 9 represents a part of the decision tree model build after the data has been discretised using optimal equal width binning. From the figure we can derive the rules required to determine the students’ grades. For Example, if a student having CGPA between 2.3 to 2.8 ends up getting over 66.7% in quiz and roughly 40% in midterm however gets over 75% in laboratory he or she is mostly likely to get C as overall grade. From the decision tree in Figure 9 we can understand that the attribute quiz had the highest information gain which is then followed by CGPA.

Figure 9 Decision tree (optimal equal width binning)


We re-sampled the actual dataset with RUS technique to balance the dataset and then we used decision tree to build the model. The accuracy is 52.00% which is a little over 7% better than the accuracy of the decision tree using actual dataset. The worst class precision comes from the prediction of class B and class D as show in Table 12. Table 12 Detailed analysis of the neural network model using re-sampled dataset with ROS

True A True B True F True C True D Class precision

Pred. A 9 1 0 0 0 90.00%

Pred. B 1 5 0 5 2 38.46%

Pred. F 0 0 3 0 3 50.00%

Pred. C 0 2 2 4 0 50.00%

Pred. D 0 2 5 1 5 38.46%

Class recall 90.00% 50.00% 30.00% 40.00% 50.00%

F-measure 90.00% 43.48% 37.50% 44.44% 43.48%



We re-sampled the actual dataset with ROS technique to balance the dataset and then we used decision tree to build the model. The accuracy is 45.28% which is a little over 0.28% worse than the accuracy of the decision tree using actual dataset. The worst class precision comes from the prediction of class C and class D as show in Table 13.

Table 13 Detailed analysis of the decision tree model using re-sampled dataset with ROS

True C True A True F True B True D Class precision

Pred. C 0 0 0 0 0 0.00%

Pred. A 0 12 0 2 0 85.71%

Pred. F 0 0 0 0 0 0.00%

Pred. B 95 40 21 151 39 43.64%

Pred. D 0 0 0 0 0 0.00%

Class recall 0.00% 23.08% 0.00% 98.69% 0.00%

F-measure 0.00% 36.37% 0.00% 60.52% 0.00%


In this case, we have used the dataset with bagging to build the model. The accuracy of the model is 56.11%. The worst class precision comes from the prediction of class F as shown in Table 14.

Table 14 Detailed analysis of the decision tree model with bagging


Pred. C 19 0 5 1 6 61.29%

Pred. A 0 11 0 0 1 91.67%

Pred. D 0 0 2 1 0 66.67%

Pred. F 1 0 1 0 0 0.00%

Pred. B 28 17 10 8 69 52.27%

Class recall 39.58% 39.29% 11.11% 0.00% 90.79%

F-measure 48.099% 55.00% 19.05% 0.00% 66.34%


In this case, we have used the dataset with combination of bagging and binning to build the model. The accuracy of the model is 62.22%, which is 16.67% better than the previous decision tree model. The worst class precision comes from the prediction of class D as shown in Table 15.


Table 15 Detailed analysis of the decision tree model with combination of bagging and binning



In this case, we have used the dataset with boosting to build the model. The accuracy of the model is 36.11%. The worst class precision comes from the prediction of class D and F as shown in Table 16. Table 16 Detailed analysis of the decision tree model with boosting



In this case, we have used the dataset with combination of boosting and binning to build the model. The accuracy of the model is 63.89%, which is 18.33% better than the previous decision tree model. The worst class precision comes from the prediction of class D as shown in Table 17. Table 17 Detailed analysis of the decision tree model with combination of boosting and binning



4.3 Neural network

4.3.1 General model

The model built using the neural network has an accuracy of 65.56%. The accuracy of the neural network classification is the third best compared to the other models. It is shown in Table 18. Table 18 Detailed analysis of the neural network model


4.3.2 Optimal equal width binning on continuous data and classification using neural network

The optimal bin width values for the neural network classification model is represented in Table 1. We can observe that for all the models where we used optimal equal width binning the best bin values for the attributes: attendance and CGPA are same, which is 2 and 4.

This model gives the highest accuracy than other model using optimal equal width binning which is 68.89%. However it is just 0.56% better than the Naive Bayes classification for which optimal equal width binning is used. A detailed representation of the performance is given in Table 19. Table 19 Detailed analysis of the neural network model (optimal equal width binning)



We re-sampled the actual dataset with RUS technique to balance the dataset and then we used neural network to build the model. This model gives the accuracy of 50.00% which is 13% worse compared to the neural network classification using actual dataset. Detailed representation of the sample is given in Table 20.


Table 20 Detailed analysis of the neural network model using re-sampled dataset with RUS

True A True B True F True C True D Class precision Pred. A 10 5 0 0 0 66.67% Pred. B 0 3 2 3 0 37.50% Pred. F 0 0 3 0 3 50.00% Pred. C 0 0 1 2 0 66.67% Pred. D 0 2 4 5 7 38.89% Class recall 100.00% 30.00% 30.00% 20.00% 70.00% F-measure 80.00% 33.33% 37.50% 30.77% 50.00%


For this model, we re-sampled the actual dataset with ROS technique to balance the dataset and then we used neural network to build the model. This model gives the accuracy of 73.06% which is also around 10% better than the neural network classification using actual dataset. Detailed representation of the sample is given in Table 21. Table 21 Detailed analysis of the neural network model using re-sampled dataset with ROS

True C True A True F True B True D Class precision Pred. C 73 0 5 21 9 67.59% Pred. A 0 43 2 14 0 72.88% Pred. F 0 0 2 0 0 100.00% Pred. B 16 9 0 115 0 82.14% Pred. D 6 0 12 3 30 58.82% Class recall 76.84% 82.69% 9.52% 75.16% 76.92% F-measure 71.92% 77.48% 17.38% 78.495% 66.66%


In this case, we have used the dataset with bagging to build the model. The accuracy of the model is 66.67%. The worst class precision comes from the prediction of class F as shown in Table 22. Table 22 Detailed analysis of the neural network model with bagging




In this case, we have used the dataset with combination of bagging and binning to build the model. The accuracy of the model is 70.00%, which is 6.11% better than the previous neural network model. The worst class precision comes from the prediction of class F as shown in Table 23. Table 23 Detailed analysis of the neural network model with combination of bagging and

binning


Pred. C 38 0 8 1 12 64.41% Pred. A 0 18 0 0 4 81.82% Pred. D 4 0 10 7 0 47.62% Pred. F 0 0 0 0 0 0.00% Pred. B 6 10 0 2 60 76.92% Class recall 79.17% 64.29% 55.56% 0.00% 78.95% F-measure 71.03% 72.00% 51.28% 0.00% 77.92%


In this case, we have used the dataset with boosting to build the model. The accuracy of the model is 66.11%. The worst class precision comes from the prediction of class F as shown in Table 24. Table 24 Detailed analysis of the neural network model with boosting


Pred. C 36 0 10 1 12 61.02% Pred. A 0 17 0 0 6 73.91% Pred. D 2 0 8 7 0 47.06% Pred. F 0 0 0 0 0 0.00% Pred. B 10 11 0 2 58 71.60% Class recall 75.00% 60.71% 44.44% 0.00% 76.32% F-measure 67.29% 66.66% 45.71% 0.00% 73.88%


In this case, we have used the dataset with combination of boosting and binning to build the model. The accuracy of the model is 67.22%, which is 3.33% better than the previous neural network model. The worst class precision comes from the prediction of class F as shown in Table 25.


Table 25 Detailed analysis of the neural network model with combination of boosting and binning


Pred. C 35 0 7 1 13 62.50%

Pred. A 0 18 0 0 5 78.26%

Pred. D 4 0 10 7 0 47.62%

Pred. F 0 0 1 0 0 0.00%

Pred. B 9 10 0 2 58 73.42%

Class recall 72.92% 64.29% 55.56% 0.00% 76.32%

F-measure 67.31% 70.59% 51.28% 0.00% 74.84%

4.4 Random forest

For random forest, the model is built using random forest classification technique.

4.4.1 Classification with actual data

In this case, we have used the dataset without optimal binning to build the model. The accuracy of the model is 43.33%. The worst class precision comes from the prediction of class D and F as shown in Table 26. Table 26 Detailed analysis of the random forest model without optimal equal width binning


Pred. C 2 0 1 0 1 50.00%

Pred. A 0 2 0 0 1 66.67%

Pred. D 0 0 0 0 0 0.00%

Pred. F 0 0 0 0 0 0.00%

Pred. B 46 26 17 10 74 42.77%

Class recall 4.17% 7.14% 0.00% 0.00% 97.37%

F-measure 7.698% 12.899% 0.00% 0.00% 59.43%

4.4.2 Classification with optimal equal width binning

Optimal equal width binning is used to discretise every continuous attributes and then the model is built using the random forest classification. The accuracy of the model is 56.67%, which is 13.34% better than the previous random forest model. Moreover, the precision of the predicted class C is the worst for this model as shown in Table 27.


Table 27 Detailed analysis of the random forest model with optimal equal width binning

True C True A True D True F True B Class precision Pred. C 22 0 14 3 12 43.14% Pred. A 0 14 0 0 3 82.35% Pred. D 0 0 2 2 0 50.00% Pred. F 0 0 0 3 0 100.00%

Table 27 Detailed analysis of the random forest model with optimal equal width binning (continued)

True C True A True D True F True B Class precision Pred. B 26 14 2 2 61 58.10% Class recall 45.83% 50.00% 11.11% 30.00% 80.26% F-measure 44.44% 62.22% 18.18% 46.15% 67.41%

4.5 Summary of analysis

In Figure 10, an overall summary of the models of Naïve Bayes is represented. In the figure multiple bar graphs are shown where on the x-axis each chunk highlights analysis of a particular model. The y-axis represents the percentage of the results obtained for accuracy, average precision, average recall and average F-measure for each model. The data in Table 28 is the percentage values for which the bar graphs are built. For Naïve Bayes classification the accuracy is increased around 7% in optimal binning technique compared to the basic Naïve Bayes classification where the class labels of continuous attributes were estimated using probability distribution function. Besides, the accuracy is increased about 3% after applying ROS. After observing the comparative analysis among the models, we can conclude that the combination of bagging and binning as well as the combination of boosting and binning improve accuracy for Naïve Bayes classification technique. But, the combination of bagging and binning is worse than optimal binning.

Figure 10 Analysis of the Naïve Bayes models (see online version for colours)

Naïve Bayes


Table 28 Percentage values of the Naïve Bayes model

Naïve Bayes Accuracy (%)

Avg. precision (%)

Avg. recall (%)

Avg. f-measure (%)

Actual data 61.11 56.4 53.31 60.31 Optimal binning 68.33 60.86 60.35 60.26 ROS 64.44 69 57.26 56.22 RUS 52 44.39 52 47.56 Bagging 61.11 56.4 53.31 51.58 Bagging + binning 67.78 60.41 60.09 59.86 Boosting 61.11 56.4 53.31 51.58 Boosting + binning 68.33 60.86 60.35 60.26

In Figure 11, an overall summary of the models of decision tree are represented. Table 29 is the percentage values for which the bar graphs are built. In decision tree classification, the accuracy is increased about 7% after applying RUS where more than 18% performance improvement has been made by using optimal binning technique.

Figure 11 Analysis of the decision tree models (see online version for colours)

Decision Tree

Table 29 Percentage values of the decision tree model

Decision tree Accuracy (%)

Avg. precision (%)

Avg. recall (%)

Avg. f-measure (%)

Actual data 45.56 45.86 24.44 19.77 Optimal binning 63.89 50.56 48.96 49.54 ROS 45.28 25.87 24.35 19.38 RUS 52 53.38 52 51.78 Bagging 56.11 54.38 36.15 37.7 Bagging + binning 62.22 59.9 52.17 54.06 Boosting 36.11 31.71 24.42 22.79 Boosting + binning 63.89 61.45 55.09 57.14


In Figure 12, an overall summary of the models of neural network are represented. Table 30 is the percentage values for which the bar graphs are built. In neural network classification, the accuracy is increased of about 9.17% after applying ROS. In case of neural network classification technique, the combination of bagging and binning as well as the combination of boosting and binning perform very well. But the combination of boosting and binning is worst compared to optimal binning. However, the combination of bagging and binning of neural network model shows greater accuracy compared to the other models except ROS which has highest accuracy in neural network classification. Moreover, it is observed that binning works well for random forest model. It is also represented in Table 31 and Figure 13.

Figure 12 Analysis of the neural network models (see online version for colours)

Neural Network

Table 30 Percentage values of the neural network model

Neural network Accuracy (%)

Avg. precision (%)

Avg. recall (%)

Avg. f-measure (%)

Actual data 63.89 50.01 50.23 49.86 Optimal binning 68.89 53.3 55.91 54.16 ROS 73.06 76.29 64.23 62.39 RUS 50 51.95 50 46.32 Bagging 66.67 51.85 50.9 50.62 Bagging + binning 70 54.15 55.59 54.45 Boosting 66.11 50.72 51.29 50.71 Boosting + binning 67.22 52.36 53.82 52.8

Table 31 Random forest comparison

Method Accuracy (%)

Avg. precision (%)

Avg. recall (%)

Avg. f-measure (%)

Random forest (actual data)

43.33 31.89 21.74 16.01

Random forest (optimal binning)

56.67 66.72 43.44 47.68


Figure 13 Analysis of random forest models (see online version for colours)

Random Forest

5 Conclusions and future work

Our objective is to build a model that will predict the grade of the students for a particular course. We have successfully built eight models for each classifier and made comparative analysis between them. Optimal equal width binning (Kaya, 2008) is used in basic classification techniques. We have tried to identify the specific classifier that works best out of the six methods namely bagging, boosting, with optimal equal width binning, without optimal equal width binning, combination of bagging and binning, combination of boosting and binning; when offered with continuous dataset. For Naïve Bayes Classification the accuracy is increased of about 7% both for optimal equal width binning as well as combination of boosting and binning compared to the basic model of Naïve Bayes classification where the class labels of continuous attributes were estimated using probability distribution function. In decision tree classification, optimal equal width binning improves the performance substantially. For Naïve Bayes classification the accuracy is increased about 3% after applying ROS. However for decision tree classification, the accuracy is increased about 7% after applying RUS. For neural network classification, the accuracy is increased of about 9.17% after applying ROS. For neural network technique, ROS performs best where for two other data mining techniques Optimal binning as well as combination of bagging and binning perform well. Future work includes applying other re-sampling technique like SMOTE and combination of re-sampling techniques with optimal equal width binning.


References Ayers, E., Nugent, R. and Dean, N. (2009) ‘A comparison of student skill knowledge estimates’,

International Conference on Educational Data Mining, Cordoba, Spain, pp.1–10. Bekkar, M. and Alitouche, T.A. (2013) ‘Imbalanced data learning approaches review’,

International Journal of Data Mining & Knowledge Management Process (IJDKP), July, Vol. 3, No. 4, pp.15–33, doi: 10.5121/ijdkp.2013.3402 15.

Bharadwaj, B.K. and Pal, S. (2011) ‘Mining educational data to analyze students’ performance’, International Journal of Advance Computer Science and Applications, Vol. 2, No. 6, pp.63–69.

Bharadwaj, B.K. and Pal, S. (2012) ‘Data mining: a prediction for performance improvement using classification’, International Journal of Computer Science and Information Security, Vol. 9, No. 4, pp.136–140.

Breiman, L. (1996) ‘Bagging predictors’, Machine Learning, Vol. 24, No. 2, pp.123–140. Breiman, L. (2001) ‘Random forests’, Machine Learning, Vol. 45, No. 1, pp.5–32. Chen, Y. (2008) Learning Classifiers from Imbalanced, Only Positive and Unlabeled Data Sets

[online] https://www.cs.iastate.edu/~yetianc/cs573/files/CS573_ProjectReport_ YetianChen.pdf (accessed 25 July 2014).

Efron, B. and Tibshirani, R.J. (1993) An Introduction to the Bootstrap, Chapman & Hall, New York.

Kaya, F. (2008) Discretizing Continuous Features for Naive Bayes and C4. 5 Classifiers, University of Maryland [online] https://www.cs.umd.edu/sites/default/files/scholarly_papers/ fatih-kaya_1.pdf (accessed 12 February 2014).

Nitesh, V. and Chawla et al. (2002) ‘SMOTE: synthetic minority over-sampling technique’, Journal of Artificial Intelligence Research, Vol. 16, No. 1, pp.321–357.

Pal, A.K. and Pal, S. (2013) ‘Analysis and mining of educational data for predicting the performance of students’, International Journal of Electronics Communication and Computer Engineering, Vol. 4, No. 5, pp.1560–1565.

Pandey, M. and Taruna, S. (2014) ‘A comparative study of ensemble methods for students’ performance modeling’, International Journal of Computer Applications, Vol. 103, No. 8, pp.26–32.

Quinlan, J.R. (1993) C4. 5: Programs for Machine Learning, Morgan Kaufmann, San Francisco, CA, USA.

Romero, C. and Ventura, S. (2010) ‘Educational data mining: a review of the state of the art’, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, Vol. 40, No. 6, pp.601–618.

Rus, V., Lintean, M. and Azevedo, R. (2009) ‘Automatic detection of student mental models during prior knowledge activation in MetaTutor’, International Conference on Educational Data Mining, Cordoba, Spain, pp.161–170.

Sharaf, A., Malaka, E., Moustafa, A., Harb, H.M. and Emara, A.H. (2013) ‘Adaboost ensemble with simple genetic algorithm for student prediction model’, International Journal of Computer Science & Information Technology (IJCSIT), April, Vol. 5, No. 2, pp.73–85.

Tan, P., Kumar, V. and Steinbach, M. (2006) Introduction To Data Mining, Adison Wesley, USA. Wikipedia, Backpropagation [online] http//:www.en.wikipedia.org/wiki/Backpropagation

(accessed 25 July 2014). Yadav, S.K., Bharadwaj, B. and Pal, S. (2012) ‘Data mining applications: a comparative study for

predicting student’s performance’, arXiv preprint arXiv:1202.4815.

Raisul Islam Rashu*, Syed Tanveer Jishan, Naheena Haq and …€¦ · Syed Tanveer Jishan is an...

Documents

Transcript of Raisul Islam Rashu*, Syed Tanveer Jishan, Naheena Haq and …€¦ · Syed Tanveer Jishan is an...