Classification for Early Detection of High School Students ... · Classification for Early...

1

Classification for Early Detection of High School Students Vulnerable to Poor

Academic Performance

Jevon M. Mckenzie1, Soumya Ray2*

1. National Tsing Hua University Department of Technology Management, 2. National Tsing Hua University Institute of

Service Science

Taiwan, R.O.C.

*: Corresponding Author. Email: [email protected]. Fax: 886-3-561 0141

ABSTRACT

This paper aims to analyze various student demographic information. The purpose of doing so is to

provide educational institutions with meaningful high-level information on students. This information

will allow for the proactive intervention of identifying and providing additional support to high school

students who are highly vulnerable to failing.

The models in this research identified two groups of factors that are significant determinants of a

student’s academic performance: (1) Factors that relate to the parents (parents education, resources

available at home, paid extra classes, address, student willingness to pursue higher education), and (2)

Factors that relate to the student as an individual (study time, free time, alcoholic consumption). Four

datamining models were applied: Logistic Regression using PCA Scores and using Stepwise

implementation, Decision Tree, and Random Forrest. PCA allowed us to identify orthogonal

components that were used as the independent variables in the logistic regression model. This method

proved to produce the best results.

Key words: Student academic Performance, Datamining, Intervention

2

Introduction

Throughout the world, student achievement is often associated with the future economic power and

competitiveness of a country (Pekka and Nissinen, 2011). Although this philosophy has been

acknowledged and has stimulated numerous related researches in the past, it is inevitable that the

academic performance of each generation of students will be hindered by different demographic

factors. Understanding this, we must strive to create and maintain a learning environment that is

conducive to improving the academic performance of our students. In order to achieve this, there is

much need to understand and identify significant and consistent factors that relate to students’

academic performance.

Nowadays the new trend is to use automated tools to analyze raw student data and extract interesting

information. Considering the multiple sources of data (e.g. traditional databases, online web pages) and

diverse interest groups (e.g. students, teachers, administrators or alumni), the education sector offers a

fruitful ground for Business Intelligence applications (Minaei-Bidgoli, 2003). Since it can help to

achieve a better understanding of this phenomenon and ultimately improve it, modeling student

performance is an important tool for both educators and students. For instance, school professionals

could perform corrective measures for weak students (e.g. remedial classes).

This research is driven by datamining techniques that are oriented towards prediction. Belize has not

yet focused on Business Intelligence and Datamining approaches to evaluate its education system;

therefore, the availability of demographic information on its student population is almost non-existent.

In order for this research to be applicable to Belize, when choosing the dataset we considered some

general characteristics of the Belizean student population and tried to find a student dataset from a

country that has a student population with similar characteristics. Belize is a very young and

developing country; therefore, most of its current generation of students are First Generation students.

These students are those whose parents did not attend college, hence the majority of them lack the

resources, moral support, and motivation to strive for excellence and pursue a higher education

(Cushman, 2006). In these families, role assignments about work, family, religion and community are

passed down through the generations creating “intergenerational continuity.” When a family member

disrupts this system by choosing to attend college, he or she experiences a shift in identity, leading to a

sense of loss. Not prepared for this loss, many first-generation students may come to develop two

different identities—one for home and another for college (Bedford, 2006).

http://www.jstor.org/stable/1084908

http://dx.doi.org/10.1080/03634520410001682401

http://dx.doi.org/10.1080/03634520410001682401

3

Considering the similarity between Belize and Portugal as it relates to the low percentage of adults with

postsecondary education, the dataset used in this research contains various demographic information on

students from two Portuguese high schools. The origin of the dataset is from the UCI Machine

Learning Repository (Paulo and Silva, 2008). In alignment with the research on educational challenges

of “First Generation Students”, correlation of the variables within the dataset in this research identified

that students’ willingness to pursue higher education is the factor that has the highest correlation with

the students’ academic performance. Other data analysis such as heat maps and PCA identified other

significant factors that were consistent with those that define First Generation students. Ultimately,

although some First Generation students overcome the many adversities (identified by the significant

factors) that they are faced with, most of them don’t. As a consequence of this, it is imperative that we

identify and classify those students who are exposed to the various challenges described above.

The Proposed Approach and Methodology

Software to be used

Subsequent to collecting the data for this research it was imperative that we understood the significance

of the data that was collected. Data visualization techniques were therefore implemented to accomplish

that. Data visualization software allowed us to present our data in a pictorial and graphical format. It

gave us a visual representation of the data, which allowed us to grasp difficult concepts and identify

new patterns, trends and correlations that might have gone undetected in the text-based data. Software

that were used to visualize the data included: Tibco Spotfire, Tableau, R Studio and MS Excel.

Subsequent to analyzing the data we used Principal Component analysis (PCA) to explore its

dimensions and then applied various data mining techniques. Datamining is the extraction of hidden

predictive information from large databases and was therefore the core concept upon which this

research was based. The datamining software what were used in this research included: R Studio,

Rapid Miner, Dataiku, XLMiner.

Research Model Type

This research is of classification type because it aims to predict if a student is highly vulnerable to

failing or not. Considering the dichotomous nature of our desired classification (failing or not), our

4

dependent variable (Y) is of type binomial. (Students grades were transformed from numerical, i.e. 0 ~

20, into binary, i.e. 1 or 0.)

Datamining classification Performance Metrics to consider

Before choosing the performance metrics to be used in evaluating the models we need to consider that

for the purpose of our research it is more important to correctly classify students who will fail as

opposed to students who will pass. When analyzing the confusion matrix we took this into

consideration and decided that the overall accuracy was not a good measure for evaluating the

classifier. Galit Shmueli’s book ‘Data Mining for Business Intelligence’ (Shmueli, 2010) tells us that in

such a case, the following pair of accuracy measures are the most popular:

• Sensitivity, which is the classifier’s ability to detect the important class members (c1) correctly.

In our case, it is the classifier’s ability to detect students who will fail.

• Specificity, which is the classifier’s ability to rule out C0 members correctly. In our case, it is

the classifier’s ability to detect students who will pass.

In reference to the two accuracy measures mentioned above, the ROC (receiver operating

characteristic) curve will be used to evaluate the fitness of the models. The ROC curve plots the pairs

{sensitivity, 1-specificity}; hence, we will use the ROC along with the two mentioned measures to

compare the models. In this research we are more focused on correctly classifying students who will

fail; therefore, the total cost misclassifying these students will also be considered. The confusion matrix

allows us to generate a cost matrix that can be used to calculate the total cost of misclassification for a

given model; therefore, these costs will be compared among the models that will be evaluated. Because

the dependent variable can only be a 0 or 1, the models coefficient of determination (R2) will always be

Hence, R2 will not be considered a metric for evaluate the fitness of our model (Romero et al., 2008).

A vital step in most classification algorithms is to estimate the probability that a case belongs to each of

the classes. The estimated probabilities of belonging to each class will be compared to a cutoff value (

i.e, if probability > cutoff value, student=Pass, else student =Fail ). A cutoff value of 0.5 is

suggesting that it is equally important to correctly classify both classes; however, for the purpose of

this research it is more important to classify one class (students who will fail) than the other. Our aim

is to create a model such that its classifications of “Fail” will have a higher fitted value (i.e., higher

predicted probability of the event) compared to its classifications of “Pass”; therefore, the cutoff value

5

that will accurately classify more of our success class is anticipated to be a cutoff value that is below

0.5. The ROC curves will be used to select and justify the optimal cutoff value for the models.

The drawbacks of each method will be considered as it relates to the dataset being used. This is

essential for understanding the effects of multicollinearity, important variable selection and the size of

the dataset. The behavior of the model reflects the magnitude of the effects caused by the natural

weakness of the model and the dataset. Data analysis of the dataset being used identified a great degree

of multicollinearity. We therefore implemented models such as a logistic regression using scores from

the PCA analysis as well as Random Forrest Classification because they are not greatly affected by

multicollinearity. In reference to selecting only important variables, the stepwise-logistic regression

was implemented.

Data Collection and Processing Described

The dataset used in this research is from two Portuguese high schools (Paulo and Silva, 2008). The data

attributes include student grades, demographic, social and school related features. It was collected by

using school reports and questionnaires. Two datasets were available regarding the performance in two

distinct subjects: Mathematics (mat) and Portuguese language (por). To increase the size of the dataset

in this research, these two datasets were merged into one dataset and disregarded whether the grades

were from Mathematics or Language. Note that the outcome/dependent variable was originally a

numeric grade between 1 and 20. This grade was converted to a pass or fail status where grades less

than 12 were labeled Fail and grades of 12 or greater were labeled Pass. This criteria was chosen based

on the grading criteria used in Belize. The following is an outline of the data attribute information:

Attribute Information

1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)

2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)

3 age - student’s age (numeric: from 15 to 22)

4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)

5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)

6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)

7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“

secondary education or 4 â€“ higher education)

8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“

secondary education or 4 â€“ higher education)

6

9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police),

‘at_home’ or ‘other’)

10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police),

‘at_home’ or ‘other’)

11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or

‘other’)

12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)

13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4

- >1 hour)

14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

15 famsup - family educational support (binary: yes or no)

16 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

17 activities - extra-curricular activities (binary: yes or no)

18 nursery - attended nursery school (binary: yes or no)

19 higher - wants to take higher education (binary: yes or no)

20 internet - Internet access at home (binary: yes or no)

21 romantic - with a romantic relationship (binary: yes or no)

22 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

23 freetime - free time after school (numeric: from 1 - very low to 5 - very high)

24 goout - going out with friends (numeric: from 1 - very low to 5 - very high)

25 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

26 health - current health status (numeric: from 1 - very bad to 5 - very good)

27 G1 - first period grade (numeric: from 0 to 20)

In reference to the outcome variable (grade), there were three grades to consider using: G1 – first

period grade (numeric: from 0 to 20), G2 – second period grade (numeric: from 0 to 20), or G3 – final

grade (numeric: from 0 to 20, output target). Attribute G3 is the easiest to predict because it has a

strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at

the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. Nevertheless, our research

aims to classify students without using historical data on their academic performance; therefore, this

research uses G1 as its outcome/dependent variable.

Considering the aim of this research, the following historical data were not used in any of the models:

absence for the class, previous failures of the class, school support, school, Grade2 and Grade3. The

7

reason for not including those variables is because those values will not be available at the time of

prediction, i.e, when a new student comes in.

Data Visualizations and Analysis

Data Description

Considering the nature of our data, heat maps were used for data visualization. A heat map is a literal

way of visualizing a table of numbers, where you substitute the numbers with colored cells. The shades

of colors correspond to the level of the measurements.

The heat map in Figure 1: Heatmap of Medu Vs Higher shows that a higher number of students plans

to attend higher education where the mother’s education is high. Figure 2: Heatmap of Medu Vs G1

shows that there is a higher number of students passing where the mother’s education is high. Figure 3:

Heatmap of Higher Vs G1 shows that there is a higher number of students passing where there is the

desire to pursue higher education.

Correlations

Subsequent to visualizing the data, we analyzed the relationship among the variables. Many data sets

contain highly correlated variables that measure the same kind of information in different ways. This is

referred to as multicollinearity. As mentioned earlier, some algorithms will build unstable models in

the presence of multicollinearity. It is therefore important to identify and if possible remove highly

(linearly) correlated variables. The correlation matrix allows us to do this. In addition to identifying

variables that are related to each other, the correlation matrix also allows us to see the polarity of the

relationships, i.e. if the variables are inversely related (negative values) or vice versa.

The following are correlations (greater than 15%) among the independent variables against the dependent

variable: Medu: 0.19, Fedu: 0.18, Mjob: 0.15, Study: 0.20 and Higher: 0.29. These variables are thus the

variables that have the strongest relationship with the students’ grade. To check for signs of

multicollinearity we also looked at the correlation among the independent variables themselves. There

is a very high correlation among the Mother’s and Father’s education (0.64) so I decided to choose one

(Mother’s education) and look at its correlation with the other independent variables. The following are

correlations (greater than 15%) among the independent variables against Mother’s education: Address: 0.18,

Fedu: 0.64, Mjob: 0.63, Fjob: 0.28, Travel time: -0.24, Higher: 0.21, Internet: 0.25. Remember that the

8

student’s willingness to pursue higher education is also highly correlated with Grade 1; therefore, I

decided to look at its correlation with the other independent variables.

The following are correlations (greater than 15%) of the independent variables against the variable Higher: Age:

-0.23, Fedu: 0.19, Mjob: 0.18, Study time: 0.19, Medu: 0.21. The following are correlations (greater than 15%)

of the independent variables against themselves. Travel & Address: -0.34, Study time & age: 0.24, Study time &

Walc: -0.23

Exploration through PCA

The correlations mentioned above makes it evident that multicollinearity exist within our dataset.

Considering this, the accuracy and reliability of our classification is expected suffer if we include those

highly correlated variables. In order to understand the dimensionality and address this concern we did a

Principal Component Analysis (PCA).

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to

convert a set of observation of possibly correlated variables into a set of values of linearly

uncorrelated variables called principal components. This transformation is defined in such a way that

the first principal component has the largest possible variance (that is, accounts for as much of the

variability in the data as possible), and each succeeding component in turn has the highest variance

possible under the constraint that it is orthogonal to the preceding components. The resulting vectors

are an uncorrelated orthogonal basis set.

In datamining, variance among the independent variables means those variables can ‘tell a great story’,

i.e. there is valuable information within those variables. A PCA outputs various metrics such as the SS

Loadings that identifies variance. It’s conventional that SS Loadings greater than 1 identifies the

significant PC’s. Our initial PCA analysis identified 10 PC’s with loading greater than 1; however,

further analysis using a scree plot identified only five significant PC’s. Looking at Table 1: Variance of

PC’s, we can see the SS Loadings (loadings >1) for the five PC’s that the scree pot identified.

The scree plot in Figure 4: Scree Plot was used to further analyze the variance among the PC’s. A

scree plot displays the eigenvalues associated with a component or factor in descending order versus

the number of the component or factor. Contrary to the SS Loadings of the 10 PC’s, the scree plot in

Figure 4: Scree Plot shows that 5 of the PC’s explain most of the variability. This is evident from the

line that starts to straighten after factor 5. The remaining factors explain a very small proportion of the

variability and are likely unimportant.

https://en.wikipedia.org/wiki/Orthogonal_transformation

https://en.wikipedia.org/wiki/Correlation_and_dependence

https://en.wikipedia.org/wiki/Correlation_and_dependence

https://en.wikipedia.org/wiki/Variance

https://en.wikipedia.org/wiki/Orthogonal

https://en.wikipedia.org/wiki/Orthogonal_basis_set

9

A parallel analysis was also done to further evaluate the variance within the PC’s. The parallel analysis

as shown in Figure 5: Parallel Analysis introduces a set of PC’s that was derived from a random

dataset. When the eigenvalues from the random data are larger than the eigenvalues from the original

pca or factor analysis we can consider the components or factors as mostly random noise. Looking at

the parallel we can see that only three PC’s are clearly above the red line that identifies the optimal

coordinates.

The values in the Table 2: Variance of Variables describe the amount of variance that is explained by

each variable with respect to the three PC’s. PC1 is mostly defined by: Mother’s Education (0.8), Father’s

Education (0.72), Mother’s Job (0.71), Father’s Job (0.45), Travel Time (-0.41), Willingness to pursue higher

education (0.43), Internet access at home (0.40), Grade 1 (0.31) Looking at the variables in PC1 we can see that

they are more related to the parents, socioeconomic factors and factors that affect First Generation Students.

PC2 is mostly defined by: Sex (-0.62), Study time (-0.49), Free time (0.52) Go out (0.55), Walc (0.66). Looking

at the variables in PC2 we can see that they are more related to factors that are influenced by the

student as an individual. PC3 is mostly defined by: Age (0.56), Guardian (-0.61).

Further analysis of the data was done by plotting the pca loadings of PC 1 and PC2. Looking at the plot

of the PCA Loadings in Figure6: PCA Loadings Plot we can see what somewhat looks like clusters of

variables. The variables that appear to be grouped together are plotted as such because, as the

correlation matrix shows, they are highly correlated. Note that “higher” is the closest independent

variable to the dependent variable (G1). The plot shows that there is not much variance within the data;

therefore, there is a high level of collinearity among the variables.

A final analysis of the PC’s was done by plotting PC1.scores against PC2.scores (colored by pass/fail).

The plot shown in Figure 7: PCA Scores Plot, each point on the graph represents a student in the

dataset. The color red indicates a student who failed and the color blue indicates a student who passed.

Looking at the graph we can see that most students who failed appear to be in the first quadrant of the

graph while most of the students who passed are in the fourth quadrant of the graph.

10

Datamining Models

The datamining models for this research were limited to models that cater for the classification of

binary/binomial type dependent variables; therefore, the following classification methods were applied

and evaluated:

o logistic regression with pc-scores (with and without oversampling)

o logistic regression with step-wise implementation

o classification Tree

o random-forest (classification tree)

Note: Because the ratio of pass to fail within the dataset set was 75/25, oversampling was considered

where the training set was adjusted such that there was an equal number of passes and failures.

Drawbacks of the Models to be used General Drawbacks of Logistic Regression:

Many datasets often contain variables that for various reasons don’t provide much information. In a

logistic regression only the meaningful variables should be included; however, it is important that all

meaningful variables are included. Logistic regression models should have little or no multicollinearity

because if the observations are related to one another, then the model will tend to overweight the

significance of those observations. Because the parameter estimation procedure of a logistic regression

relies heavily on having an adequate number of samples for each combination of independent variables,

small sample sizes can lead to widely inaccurate estimates of parameters.

General Drawbacks of Decision Trees:

Some of these are related to the problem of multicollinearity. When two variables both explain the

same thing, a decision tree will greedily choose the best one, whereas many other methods will use

them both. Ensemble methods such as random forests can negate this to a certain extent, but you lose

the ease of understanding (Breiman et al., 1984).

http://en.wikipedia.org/wiki/Multicollinearity

11

General Drawbacks of Random Forrest:

The main limitation of the Random Forests algorithm is that a large number of trees may make the

algorithm slow for real-time prediction; however, very highly correlated independent variables have a

slightly negative effect on the model (Breiman, 2001).

Analysis of Results

Misclassification Costs

When evaluating our models we need to consider the costs that are introduced with the

misclassification of each class. In our study, the cost of misclassifying the actual ”fail students” as

”pass students” is much costlier than misclassifying the actual ”pass students” as ”fail students. Cost-

sensitive measures usually assume that the costs of making an error are known. That is one has a cost-

matrix, which defines the costs incurred in false positives and false negatives. Each example, x, can be

associated with a cost C(i, j, x), which defines the cost of predicting class i for x when the "true" class

is j. The goal is to take a decision to minimize the expected cost (Batista, Prati, and Monard, 2004). If a

student was classified as “pass” but actually fails, he/she would have to incur the cost of paying an

extra year of tuition fees and additional lunch fees. On the contrary, if the student was classified as a

“fail” but actually passed, the school would have incurred the unnecessary cost of providing remedial

classes and mentoring. Let’s say the tuition fees cost $700 for the year, and the cost of lunch add up to

$1000 for the year. The total cost of misclassifying the student as “pass” would be $1700. Let’s say

that the cost for the school to provide remedial classes and mentoring is $300 per student.

The total misclassification cost was calculated as:

Total cost = (cost of misclassifying as “pass” * number of misclassification as “pass”) + (cost of

misclassifying as “fail” * number of misclassification as “fail”)

Results from Logistic Regression with PCA Scores Using R

Reflecting on the analysis of the data, multicollinearity was anticipated to be a major issue. In an

attempt to alleviate multicollinearity issues, the logistic regression using PCA scores was implemented.

Performing a PCA analysis addressed the issue of multicollinearity because it produced PC’s that are

orthogonal to each other. The pca analysis above shows that the first 5 PC’s can be considered

significant; therefore, the scores of those five PC’s were used as the independent variables in this

logistic regression. A dataset is imbalanced if the classification categories are not approximately

12

equally represented. For such problems, the interest usually leans towards correct classification of the

“rare” class. Considering that within our dataset 71% of the outcome variable was a passing grade, we

did an oversampling of the training data to check if there were any issues that may be caused by

imbalanced data.

The overall results from the test dataset without oversampling was fairly better than the results from the

test dataset that was derived from the training with oversampling. The results are shown in Table 3: G

Logistic Regression, PCA Test Without Oversampling Accuracy Metrics and Table 4: H Logistic

Regression, PCA Test Without Oversampling Confusion Matrix with Misclassification Cost.

Results from Stepwise Logistic Regression Using R

As stated above, choosing only meaningful independent variables is also essential for a logistic

regression, hence the reason for implementing the step-wise approach. The step-wise logistic

regression approach builds its model by adding variables one by one (forward-selection) or by

removing variables one by one (backward-stepwise) in an attempt to identify and keep only those

variables that produce the best results. This method however, does not address the issue of

multicollinearity. The stepwise approach has a serious problem with multicollinearity because it will

tends to choose the predictors that best match the data sample. It will favor variables that have a high-

magnitude coefficients but not necessarily for the underlying population. Say that you have 2 correlated

predictors if both contribute equally to fitting your data sample there's a chance that you will miss both

in your model if they individually are "less significant" than other non-correlated predictors (which

happened to have high coefficient magnitudes for the sample) and thus were left out of your stepwise

selection.

This approach identified the variables in Table 5: Logistic Regression Stepwise Coefficients as the

predictors that best match the data sample. Those are some of the variables that have a high correlation

with the dependent variable; however, the correlations mentioned above show that those variables have

a significant correlation among themselves as well as with other independent variables. Those

correlations show that other variables such as father’s education and mother’ job also had a high

correlation with the dependent variable; however, the stepwise approach didn’t include them in the

model. The variables chosen as the best predictors for data sample were Family Size, Age, Internet,

Family Support, Paid, Mother’s Education, Study time, and Higher. The collinearity among those

variables caused the model to fail in identifying some of their significance. Looking at Table 5:

Logistic Regression Stepwise Coefficients we can see that although mother’s education is a highly

13

significant variable it is not identified as such. The reason is because mother’s education is probably

equally fitting to the model as other independent variables such as ‘paid’ and ‘higher’. We can also

assume this based on domain knowledge, which is supported by researches done on ‘effects of First

Generation Students’. Furthermore, the correlations also show that there is a high correlation among

those variables themselves as well as with others that are not included in the model.

Results from Decision Tree Using Rapid Miner

Similarly to the multicollinearity issues with the stepwise logistic regression, the decision tree model

fails to identify some variables as significant. Although mother’s education and Father’s education are

highly correlated with the student’s grade they are not identified as significant variables in the tree.

Decision trees classify instances by sorting them down the tree from the root to some leaf node, which

provides the classification of the instance. Each node in the tree specifies a test of some attribute of the

instance, and each branch descending from that node corresponds to one of the possible values for this

attribute (Breiman et al., 1984). As mentioned in the drawbacks of a decision tree, when two variables

both explain the same thing, the decision tree will greedily choose the best one. The correlation matrix

shows that these variables were correlated with other independent variables; therefore, the model chose

the variables that it deemed the better ones. As a consequence of that the model was weakened.

This model outputs a high sensitivity, but a very low specificity. The very low specificity indicates that

there was a high misclassification of students who passed. This model produced a Sensitivity of 64,

Specificity of 35% , Accuracy of 44% , Misclassification of 56% and AUC of 28% This contributed

heavily to a very high total misclassification cost; which proved the model to be unreliable.

Results from Random Forrest Using Dataiku

As mentioned in the drawbacks of the model, the main limitation of the Random Forests algorithm is

that a large number of trees may make the algorithm slow for real-time prediction (Breiman, 2001).

Considering the small size of our dataset that wasn’t a concern. This model produced a Sensitivity of

88%, Specificity of 52% and Total Misclassification Cost of 333. The AUC is 72%. Below is the

confusion matrix that was calculated. Note that the software chose 0.2 as its optimal cutoff. The model

identified the significant coefficients that are also consistent with those identified by the correlation

matrix and PCA output. Notice that because the Random Forrest algorithm is not heavily affected by

multicollinearity this model produced fairly good results.

ROC Curves (all models)

14

The Receiver Operating Characteristic (or ROC) curve shows the true positive rate vs. the false

positive resulting from different cutoffs in the predictive model. The "faster" the curve climbs, the

better it is.

On the contrary, a curve close to the diagonal line is worse.

ROC curves for backwards Stepwise regression and PC Scores regression (Plotted on the same graph)

are shown in Figure 8:ROC, Logistic Regression PCA and Stepwise

The Red curve represents the roc curve for stepwise regression and Blue curve represents the roc curve

for the PC scores regression. Looking at the steepness of the line (where the line looks vertical) it can

be seen that the optimal cutoff for the roc curve is 0.2.

Chosen Model

The logistic Regression using the PC scores (no oversampling) produced the best model. This approach

provided a model with a cutoff of 0.2 that produced a sensitivity (correctly classified fail) of 93.48%,

specificity (correctly classified pass) of 60.36%, accuracy (overall correct classification) of 70% and

misclassification (overall misclassification) of 30%, and the lowest total misclassification cost of 183.

Although some of the other approaches produced models with a sensitivity greater than 93.48%, it is

very important to consider the specificity and the misclassification cost. Hence the reason this model

was evaluated to be the best model.

Implementation (How to Deploy and Intervene)

A computerized database is definitely needed for an easy, cheap, effective and efficient long-term use

of the model. A software can be designed to extract relevant data from the database and execute the

model. This model can be adopted and implemented at either the institutional level or at the ministry

level. Schools can individually choose to collect the necessary data, run the model, and make the

necessary interventions to assist the target students. On the other hand, the ministry of education can

use this tool to identify and gauge the level of additional support that they will provide to a particular

school. The resources that the students need to improve their academic performance is limited;

therefore, the implementation of this model will definitely benefit resource management.

The greatest challenge for the implementation of this tool in Belize is the unavailability of the

digitized data that is needed. Most schools in Belize are still using manual databases; furthermore, they

15

don’t collect much data on student demographics. There is also the concern of data privacy. This

concern may cause students to be reluctant to provide the required information and in some cases they

may provide inaccurate information. The nature of business intelligence/datamining is such that it

requires sufficient and accurate data; therefore, this needs to be taken into consideration.

Challenges

The greatest challenge when designing this model was the availability of data that relates to student

demographics and academic performance. It was not feasible to manually collect the desired data

because of the limited (3 months) time I had to do the research. Data privacy would have also been an

issue if I tried to manually collect the data. Among other things, these issues were some of the reasons

why I could find a larger dataset. As outlined by the drawbacks of the models, the size of the dataset

can affect their performance, hence the reason there is room for improving the models.

Future Work

There are multiple levels for many of the independent variables used in the model. For example,

‘mother’s job’ has 5 levels: 0 = at home, 1 = other, 2 = service department, 3 = health, 4 = teacher.

Future work on this research should modify the model such that it can identify the varying effects at

different levels of the individual variables. This will allow for a more detailed classification and more

customized application of intervention on vulnerable students.

Conclusion

Although it is not feasible to change the factors that relate to the parents, policy makers or the school

can fill in the gap by providing some alternative form of teaching methodology and assessment to the

students identified as those vulnerable to fail. Remedial classes can also be offered. This can increase

study time, compensate for family support, counselling to increase their desire to pursue higher

education, as well as provide resources such as internet access that might not be in the homes.

This research aimed to analyze various student demographics in an effort to classify the

students as a students who will Fail or Pass; however, no historical data (eg. Previous grades, previous

failures, previous absences) was used. As opposed to using data based on previous performance of the

students, this proactive approach is what differentiates this research from most of the other related

16

researches. This model will be easily adopted by any institution regardless of the availability of data on

the students’ previous performance.

Some models produced a very high true classification rate for the students who will fail and a

very small true classification rate for students who will pass; however, that could mean that many of

the students who are classified as failed are students who will actually pass. Although this

misclassification cost is less than the contrary, it still had a great effect on the total misclassification

cost. Correctly identifying and improving the educational achievement of highly vulnerable students

would be evidence of a successful policy design and implementation derived from the solution of this

project.

17

Reference

Breiman L. “Random Forests. Machine Learning”, 45, no. 1, pp 5–32., 2001

Breiman L., F. Jerome, O. Richard, and S. Charles “Classification and Regression Trees”, 1984.

Batista, G. E. A. P. A., Prati, R. C., and Monard, M. C., “A study of the behavior of several methods for

balancing machine learning training data”. SIGKDD Explorations, 6(1), 2004.

Cushman K. “First in the family: Your college years. Advice about college from first generation

students”. Next Generation Press, 2006.

Minaei-Bidgoli “Using datamining to predict secondary school student performance”, 2003.

Paulo C. and D. Silva “Using Datamining to Predict Secondary School Student Performance”,

2008

Pekka K. and K. Nissinen “Background factors behind mathematics achievement in Finnish education

context: Explanatory models based on TIMSS 1999 and TIMSS 2011 data”, 2011.

REUTERS/Keith Bedford, “The challenges of a first generation college student are tough, but not

impossible”, 2015.

Romero C., S. Ventura, P. Espejo, and C. Hervs “Data mining algorithms to classify students,” 1st

International Conference on Educational Data Mining”, June 2008.

Shmueli G. “Data Mining for Business Intelligence”, 2nd ed., Chpt. 5, p. 101”, 2010.

18

Figure 1. Heat Map of Medu Vs Higher

Figure 2. Heat Map of Medu vs G1

19

Figure 3. Heat Map of Higher Vs G1

Figure 4. Scree Plot

20

Figure 5. Parallel Analysis

Figure 6. PCA Loadings Plot

21

Figure 7. PCA Scores Plot

Figure 8. ROC Logistic Regression PCA and Stepwise

22

Table 1. Variance of PC’s

PC1 PC2 PC3 PC4 PC5

ss loadings 3.09 2.07 1.54 1.4 1.31

Table 2 Variance of Variables

PC1 PC2 PC3

sex -0.08 -0.62 0.18

binAge -0.24 0.1 0.56

address 0.35 -0.01 0.23

famsize 0.01 -0.1 -0.15

Pstatus 0.03 -0.1 0.32

Medu 0.8 0.13 0.12

Fedu 0.72 0.13 -0.05

Mjob 0.71 0.24 0.07

Fjob 0.45 0.07 -0.22

reason 0.28 -0.17 0.18

guardian 0.16 -0.03 -0.61

traveltime -0.41 0.08 -0.22

studytime 0.23 -0.49 0.16

famsupport 0.25 -0.17 0.07

paid 0.26 -0.03 0.2

activities 0.19 0.17 -0.03

nursery 0.19 -0.07 0.07

higher 0.43 -0.26 -0.25

internet 0.4 0.16 0.26

romantic -0.09 -0.05 0.38

famrel 0.05 0.11 -0.07

freetime -0.02 0.52 0.08

goout -0.02 0.55 0.19

walc -0.08 0.66 -0.01

health 0.02 0.25 -0.22

G1 -0.38 0.22 0.18

Commented [u1]: Present tables with the minimum use of horizontal rules (usually three are sufficient).

Commented [u2]: Present tables with the minimum use of horizontal rules (usually three are sufficient).

23

Table 3. Logistic Regression, PCA Test Without Oversampling Accuracy Metrics

Test (No Oversampling)

Cutoff Sensitivity Specificity Accuracy Misclassification AUC

Total

Cost

0.5 56.52% 92.79% 82% 18% -36%

0.4 67.39% 86.49% 81% 19% -19%

0.3 78.26% 75.67% 76% 24% 3% 251

0.2 93.48% 60.36% 70% 30% 33% 183

Table 4. Logistic Regression, PCA Test Without Oversampling Confusion Matrix with Misclassification

Cost

Confusion matrix for cutoff of 0.2 (test - no oversampling)

Predicted

Negative Positive

Total

Cost

Actual

Negative 67 44 183

Positive 3 43

24

Table 5 Logistic Regression Stepwise Coefficients

Estimate Std.

Error z value Pr(>|z|) Significance

(Intercept) -0.65274 1.25717 -0.519 0.603613 binAge1 0.44293 0.18702 2.386 0.017867 *

binAge2 -0.03441 0.66696 -0.052 0.958851 famsize1 0.34256 0.18454 1.856 0.063411 Medu1 1.83682 1.23825 1.483 0.137968 Medu2 1.37898 1.2366 1.115 0.264791 Medu3 1.32442 1.23941 1.069 0.285258 Medu4 0.64559 1.24184 0.52 0.603159

studytime2 -0.53506 0.18463 -2.898 0.003755 **

studytime3 -1.55334 0.30479 -5.096 3.46E-07 ***

studytime4 -1.42009 0.42916 -3.309 0.000936 ***

famsup1 0.42009 0.17824 2.357 0.018431 *

paid1 0.71621 0.20131 3.558 0.000374 ***

higher1 -1.62078 0.293 -5.532 3.17E-08 ***

internet1 -0.43206 0.19769 -2.186 0.028845 *

Classification for Early Detection of High School Students ... · Classification for Early...

Documents

Transcript of Classification for Early Detection of High School Students ... · Classification for Early...