Post on 27-Jun-2020
1
Classification for Early Detection of High School Students Vulnerable to Poor
Academic Performance
Jevon M. Mckenzie1, Soumya Ray2*
1. National Tsing Hua University Department of Technology Management, 2. National Tsing Hua University Institute of
Service Science
Taiwan, R.O.C.
*: Corresponding Author. Email: soumya.ray@iss.nthu.edu.tw. Fax: 886-3-561 0141
ABSTRACT
This paper aims to analyze various student demographic information. The purpose of doing so is to
provide educational institutions with meaningful high-level information on students. This information
will allow for the proactive intervention of identifying and providing additional support to high school
students who are highly vulnerable to failing.
The models in this research identified two groups of factors that are significant determinants of a
student’s academic performance: (1) Factors that relate to the parents (parents education, resources
available at home, paid extra classes, address, student willingness to pursue higher education), and (2)
Factors that relate to the student as an individual (study time, free time, alcoholic consumption). Four
datamining models were applied: Logistic Regression using PCA Scores and using Stepwise
implementation, Decision Tree, and Random Forrest. PCA allowed us to identify orthogonal
components that were used as the independent variables in the logistic regression model. This method
proved to produce the best results.
Key words: Student academic Performance, Datamining, Intervention
2
Introduction
Throughout the world, student achievement is often associated with the future economic power and
competitiveness of a country (Pekka and Nissinen, 2011). Although this philosophy has been
acknowledged and has stimulated numerous related researches in the past, it is inevitable that the
academic performance of each generation of students will be hindered by different demographic
factors. Understanding this, we must strive to create and maintain a learning environment that is
conducive to improving the academic performance of our students. In order to achieve this, there is
much need to understand and identify significant and consistent factors that relate to students’
academic performance.
Nowadays the new trend is to use automated tools to analyze raw student data and extract interesting
information. Considering the multiple sources of data (e.g. traditional databases, online web pages) and
diverse interest groups (e.g. students, teachers, administrators or alumni), the education sector offers a
fruitful ground for Business Intelligence applications (Minaei-Bidgoli, 2003). Since it can help to
achieve a better understanding of this phenomenon and ultimately improve it, modeling student
performance is an important tool for both educators and students. For instance, school professionals
could perform corrective measures for weak students (e.g. remedial classes).
This research is driven by datamining techniques that are oriented towards prediction. Belize has not
yet focused on Business Intelligence and Datamining approaches to evaluate its education system;
therefore, the availability of demographic information on its student population is almost non-existent.
In order for this research to be applicable to Belize, when choosing the dataset we considered some
general characteristics of the Belizean student population and tried to find a student dataset from a
country that has a student population with similar characteristics. Belize is a very young and
developing country; therefore, most of its current generation of students are First Generation students.
These students are those whose parents did not attend college, hence the majority of them lack the
resources, moral support, and motivation to strive for excellence and pursue a higher education
(Cushman, 2006). In these families, role assignments about work, family, religion and community are
passed down through the generations creating “intergenerational continuity.” When a family member
disrupts this system by choosing to attend college, he or she experiences a shift in identity, leading to a
sense of loss. Not prepared for this loss, many first-generation students may come to develop two
different identities—one for home and another for college (Bedford, 2006).
3
Considering the similarity between Belize and Portugal as it relates to the low percentage of adults with
postsecondary education, the dataset used in this research contains various demographic information on
students from two Portuguese high schools. The origin of the dataset is from the UCI Machine
Learning Repository (Paulo and Silva, 2008). In alignment with the research on educational challenges
of “First Generation Students”, correlation of the variables within the dataset in this research identified
that students’ willingness to pursue higher education is the factor that has the highest correlation with
the students’ academic performance. Other data analysis such as heat maps and PCA identified other
significant factors that were consistent with those that define First Generation students. Ultimately,
although some First Generation students overcome the many adversities (identified by the significant
factors) that they are faced with, most of them don’t. As a consequence of this, it is imperative that we
identify and classify those students who are exposed to the various challenges described above.
The Proposed Approach and Methodology
Software to be used
Subsequent to collecting the data for this research it was imperative that we understood the significance
of the data that was collected. Data visualization techniques were therefore implemented to accomplish
that. Data visualization software allowed us to present our data in a pictorial and graphical format. It
gave us a visual representation of the data, which allowed us to grasp difficult concepts and identify
new patterns, trends and correlations that might have gone undetected in the text-based data. Software
that were used to visualize the data included: Tibco Spotfire, Tableau, R Studio and MS Excel.
Subsequent to analyzing the data we used Principal Component analysis (PCA) to explore its
dimensions and then applied various data mining techniques. Datamining is the extraction of hidden
predictive information from large databases and was therefore the core concept upon which this
research was based. The datamining software what were used in this research included: R Studio,
Rapid Miner, Dataiku, XLMiner.
Research Model Type
This research is of classification type because it aims to predict if a student is highly vulnerable to
failing or not. Considering the dichotomous nature of our desired classification (failing or not), our
4
dependent variable (Y) is of type binomial. (Students grades were transformed from numerical, i.e. 0 ~
20, into binary, i.e. 1 or 0.)
Datamining classification Performance Metrics to consider
Before choosing the performance metrics to be used in evaluating the models we need to consider that
for the purpose of our research it is more important to correctly classify students who will fail as
opposed to students who will pass. When analyzing the confusion matrix we took this into
consideration and decided that the overall accuracy was not a good measure for evaluating the
classifier. Galit Shmueli’s book ‘Data Mining for Business Intelligence’ (Shmueli, 2010) tells us that in
such a case, the following pair of accuracy measures are the most popular:
• Sensitivity, which is the classifier’s ability to detect the important class members (c1) correctly.
In our case, it is the classifier’s ability to detect students who will fail.
• Specificity, which is the classifier’s ability to rule out C0 members correctly. In our case, it is
the classifier’s ability to detect students who will pass.
In reference to the two accuracy measures mentioned above, the ROC (receiver operating
characteristic) curve will be used to evaluate the fitness of the models. The ROC curve plots the pairs
{sensitivity, 1-specificity}; hence, we will use the ROC along with the two mentioned measures to
compare the models. In this research we are more focused on correctly classifying students who will
fail; therefore, the total cost misclassifying these students will also be considered. The confusion matrix
allows us to generate a cost matrix that can be used to calculate the total cost of misclassification for a
given model; therefore, these costs will be compared among the models that will be evaluated. Because
the dependent variable can only be a 0 or 1, the models coefficient of determination (R2) will always be
Hence, R2 will not be considered a metric for evaluate the fitness of our model (Romero et al., 2008).
A vital step in most classification algorithms is to estimate the probability that a case belongs to each of
the classes. The estimated probabilities of belonging to each class will be compared to a cutoff value (
i.e, if probability > cutoff value, student=Pass, else student =Fail ). A cutoff value of 0.5 is
suggesting that it is equally important to correctly classify both classes; however, for the purpose of
this research it is more important to classify one class (students who will fail) than the other. Our aim
is to create a model such that its classifications of “Fail” will have a higher fitted value (i.e., higher
predicted probability of the event) compared to its classifications of “Pass”; therefore, the cutoff value
5
that will accurately classify more of our success class is anticipated to be a cutoff value that is below
0.5. The ROC curves will be used to select and justify the optimal cutoff value for the models.
The drawbacks of each method will be considered as it relates to the dataset being used. This is
essential for understanding the effects of multicollinearity, important variable selection and the size of
the dataset. The behavior of the model reflects the magnitude of the effects caused by the natural
weakness of the model and the dataset. Data analysis of the dataset being used identified a great degree
of multicollinearity. We therefore implemented models such as a logistic regression using scores from
the PCA analysis as well as Random Forrest Classification because they are not greatly affected by
multicollinearity. In reference to selecting only important variables, the stepwise-logistic regression
was implemented.
Data Collection and Processing Described
The dataset used in this research is from two Portuguese high schools (Paulo and Silva, 2008). The data
attributes include student grades, demographic, social and school related features. It was collected by
using school reports and questionnaires. Two datasets were available regarding the performance in two
distinct subjects: Mathematics (mat) and Portuguese language (por). To increase the size of the dataset
in this research, these two datasets were merged into one dataset and disregarded whether the grades
were from Mathematics or Language. Note that the outcome/dependent variable was originally a
numeric grade between 1 and 20. This grade was converted to a pass or fail status where grades less
than 12 were labeled Fail and grades of 12 or greater were labeled Pass. This criteria was chosen based
on the grading criteria used in Belize. The following is an outline of the data attribute information:
Attribute Information
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 –
secondary education or 4 – higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 –
secondary education or 4 – higher education)
6
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police),
‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police),
‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or
‘other’)
12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4
- >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 famsup - family educational support (binary: yes or no)
16 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
17 activities - extra-curricular activities (binary: yes or no)
18 nursery - attended nursery school (binary: yes or no)
19 higher - wants to take higher education (binary: yes or no)
20 internet - Internet access at home (binary: yes or no)
21 romantic - with a romantic relationship (binary: yes or no)
22 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
23 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
24 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
25 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
26 health - current health status (numeric: from 1 - very bad to 5 - very good)
27 G1 - first period grade (numeric: from 0 to 20)
In reference to the outcome variable (grade), there were three grades to consider using: G1 – first
period grade (numeric: from 0 to 20), G2 – second period grade (numeric: from 0 to 20), or G3 – final
grade (numeric: from 0 to 20, output target). Attribute G3 is the easiest to predict because it has a
strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at
the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. Nevertheless, our research
aims to classify students without using historical data on their academic performance; therefore, this
research uses G1 as its outcome/dependent variable.
Considering the aim of this research, the following historical data were not used in any of the models:
absence for the class, previous failures of the class, school support, school, Grade2 and Grade3. The
7
reason for not including those variables is because those values will not be available at the time of
prediction, i.e, when a new student comes in.
Data Visualizations and Analysis
Data Description
Considering the nature of our data, heat maps were used for data visualization. A heat map is a literal
way of visualizing a table of numbers, where you substitute the numbers with colored cells. The shades
of colors correspond to the level of the measurements.
The heat map in Figure 1: Heatmap of Medu Vs Higher shows that a higher number of students plans
to attend higher education where the mother’s education is high. Figure 2: Heatmap of Medu Vs G1
shows that there is a higher number of students passing where the mother’s education is high. Figure 3:
Heatmap of Higher Vs G1 shows that there is a higher number of students passing where there is the
desire to pursue higher education.
Correlations
Subsequent to visualizing the data, we analyzed the relationship among the variables. Many data sets
contain highly correlated variables that measure the same kind of information in different ways. This is
referred to as multicollinearity. As mentioned earlier, some algorithms will build unstable models in
the presence of multicollinearity. It is therefore important to identify and if possible remove highly
(linearly) correlated variables. The correlation matrix allows us to do this. In addition to identifying
variables that are related to each other, the correlation matrix also allows us to see the polarity of the
relationships, i.e. if the variables are inversely related (negative values) or vice versa.
The following are correlations (greater than 15%) among the independent variables against the dependent
variable: Medu: 0.19, Fedu: 0.18, Mjob: 0.15, Study: 0.20 and Higher: 0.29. These variables are thus the
variables that have the strongest relationship with the students’ grade. To check for signs of
multicollinearity we also looked at the correlation among the independent variables themselves. There
is a very high correlation among the Mother’s and Father’s education (0.64) so I decided to choose one
(Mother’s education) and look at its correlation with the other independent variables. The following are
correlations (greater than 15%) among the independent variables against Mother’s education: Address: 0.18,
Fedu: 0.64, Mjob: 0.63, Fjob: 0.28, Travel time: -0.24, Higher: 0.21, Internet: 0.25. Remember that the
8
student’s willingness to pursue higher education is also highly correlated with Grade 1; therefore, I
decided to look at its correlation with the other independent variables.
The following are correlations (greater than 15%) of the independent variables against the variable Higher: Age:
-0.23, Fedu: 0.19, Mjob: 0.18, Study time: 0.19, Medu: 0.21. The following are correlations (greater than 15%)
of the independent variables against themselves. Travel & Address: -0.34, Study time & age: 0.24, Study time &
Walc: -0.23
Exploration through PCA
The correlations mentioned above makes it evident that multicollinearity exist within our dataset.
Considering this, the accuracy and reliability of our classification is expected suffer if we include those
highly correlated variables. In order to understand the dimensionality and address this concern we did a
Principal Component Analysis (PCA).
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to
convert a set of observation of possibly correlated variables into a set of values of linearly
uncorrelated variables called principal components. This transformation is defined in such a way that
the first principal component has the largest possible variance (that is, accounts for as much of the
variability in the data as possible), and each succeeding component in turn has the highest variance
possible under the constraint that it is orthogonal to the preceding components. The resulting vectors
are an uncorrelated orthogonal basis set.
In datamining, variance among the independent variables means those variables can ‘tell a great story’,
i.e. there is valuable information within those variables. A PCA outputs various metrics such as the SS
Loadings that identifies variance. It’s conventional that SS Loadings greater than 1 identifies the
significant PC’s. Our initial PCA analysis identified 10 PC’s with loading greater than 1; however,
further analysis using a scree plot identified only five significant PC’s. Looking at Table 1: Variance of
PC’s, we can see the SS Loadings (loadings >1) for the five PC’s that the scree pot identified.
The scree plot in Figure 4: Scree Plot was used to further analyze the variance among the PC’s. A
scree plot displays the eigenvalues associated with a component or factor in descending order versus
the number of the component or factor. Contrary to the SS Loadings of the 10 PC’s, the scree plot in
Figure 4: Scree Plot shows that 5 of the PC’s explain most of the variability. This is evident from the
line that starts to straighten after factor 5. The remaining factors explain a very small proportion of the
variability and are likely unimportant.
9
A parallel analysis was also done to further evaluate the variance within the PC’s. The parallel analysis
as shown in Figure 5: Parallel Analysis introduces a set of PC’s that was derived from a random
dataset. When the eigenvalues from the random data are larger than the eigenvalues from the original
pca or factor analysis we can consider the components or factors as mostly random noise. Looking at
the parallel we can see that only three PC’s are clearly above the red line that identifies the optimal
coordinates.
The values in the Table 2: Variance of Variables describe the amount of variance that is explained by
each variable with respect to the three PC’s. PC1 is mostly defined by: Mother’s Education (0.8), Father’s
Education (0.72), Mother’s Job (0.71), Father’s Job (0.45), Travel Time (-0.41), Willingness to pursue higher
education (0.43), Internet access at home (0.40), Grade 1 (0.31) Looking at the variables in PC1 we can see that
they are more related to the parents, socioeconomic factors and factors that affect First Generation Students.
PC2 is mostly defined by: Sex (-0.62), Study time (-0.49), Free time (0.52) Go out (0.55), Walc (0.66). Looking
at the variables in PC2 we can see that they are more related to factors that are influenced by the
student as an individual. PC3 is mostly defined by: Age (0.56), Guardian (-0.61).
Further analysis of the data was done by plotting the pca loadings of PC 1 and PC2. Looking at the plot
of the PCA Loadings in Figure6: PCA Loadings Plot we can see what somewhat looks like clusters of
variables. The variables that appear to be grouped together are plotted as such because, as the
correlation matrix shows, they are highly correlated. Note that “higher” is the closest independent
variable to the dependent variable (G1). The plot shows that there is not much variance within the data;
therefore, there is a high level of collinearity among the variables.
A final analysis of the PC’s was done by plotting PC1.scores against PC2.scores (colored by pass/fail).
The plot shown in Figure 7: PCA Scores Plot, each point on the graph represents a student in the
dataset. The color red indicates a student who failed and the color blue indicates a student who passed.
Looking at the graph we can see that most students who failed appear to be in the first quadrant of the
graph while most of the students who passed are in the fourth quadrant of the graph.
10
Datamining Models
The datamining models for this research were limited to models that cater for the classification of
binary/binomial type dependent variables; therefore, the following classification methods were applied
and evaluated:
o logistic regression with pc-scores (with and without oversampling)
o logistic regression with step-wise implementation
o classification Tree
o random-forest (classification tree)
Note: Because the ratio of pass to fail within the dataset set was 75/25, oversampling was considered
where the training set was adjusted such that there was an equal number of passes and failures.
Drawbacks of the Models to be used General Drawbacks of Logistic Regression:
Many datasets often contain variables that for various reasons don’t provide much information. In a
logistic regression only the meaningful variables should be included; however, it is important that all
meaningful variables are included. Logistic regression models should have little or no multicollinearity
because if the observations are related to one another, then the model will tend to overweight the
significance of those observations. Because the parameter estimation procedure of a logistic regression
relies heavily on having an adequate number of samples for each combination of independent variables,
small sample sizes can lead to widely inaccurate estimates of parameters.
General Drawbacks of Decision Trees:
Some of these are related to the problem of multicollinearity. When two variables both explain the
same thing, a decision tree will greedily choose the best one, whereas many other methods will use
them both. Ensemble methods such as random forests can negate this to a certain extent, but you lose
the ease of understanding (Breiman et al., 1984).
11
General Drawbacks of Random Forrest:
The main limitation of the Random Forests algorithm is that a large number of trees may make the
algorithm slow for real-time prediction; however, very highly correlated independent variables have a
slightly negative effect on the model (Breiman, 2001).
Analysis of Results
Misclassification Costs
When evaluating our models we need to consider the costs that are introduced with the
misclassification of each class. In our study, the cost of misclassifying the actual ”fail students” as
”pass students” is much costlier than misclassifying the actual ”pass students” as ”fail students. Cost-
sensitive measures usually assume that the costs of making an error are known. That is one has a cost-
matrix, which defines the costs incurred in false positives and false negatives. Each example, x, can be
associated with a cost C(i, j, x), which defines the cost of predicting class i for x when the "true" class
is j. The goal is to take a decision to minimize the expected cost (Batista, Prati, and Monard, 2004). If a
student was classified as “pass” but actually fails, he/she would have to incur the cost of paying an
extra year of tuition fees and additional lunch fees. On the contrary, if the student was classified as a
“fail” but actually passed, the school would have incurred the unnecessary cost of providing remedial
classes and mentoring. Let’s say the tuition fees cost $700 for the year, and the cost of lunch add up to
$1000 for the year. The total cost of misclassifying the student as “pass” would be $1700. Let’s say
that the cost for the school to provide remedial classes and mentoring is $300 per student.
The total misclassification cost was calculated as:
Total cost = (cost of misclassifying as “pass” * number of misclassification as “pass”) + (cost of
misclassifying as “fail” * number of misclassification as “fail”)
Results from Logistic Regression with PCA Scores Using R
Reflecting on the analysis of the data, multicollinearity was anticipated to be a major issue. In an
attempt to alleviate multicollinearity issues, the logistic regression using PCA scores was implemented.
Performing a PCA analysis addressed the issue of multicollinearity because it produced PC’s that are
orthogonal to each other. The pca analysis above shows that the first 5 PC’s can be considered
significant; therefore, the scores of those five PC’s were used as the independent variables in this
logistic regression. A dataset is imbalanced if the classification categories are not approximately
12
equally represented. For such problems, the interest usually leans towards correct classification of the
“rare” class. Considering that within our dataset 71% of the outcome variable was a passing grade, we
did an oversampling of the training data to check if there were any issues that may be caused by
imbalanced data.
The overall results from the test dataset without oversampling was fairly better than the results from the
test dataset that was derived from the training with oversampling. The results are shown in Table 3: G
Logistic Regression, PCA Test Without Oversampling Accuracy Metrics and Table 4: H Logistic
Regression, PCA Test Without Oversampling Confusion Matrix with Misclassification Cost.
Results from Stepwise Logistic Regression Using R
As stated above, choosing only meaningful independent variables is also essential for a logistic
regression, hence the reason for implementing the step-wise approach. The step-wise logistic
regression approach builds its model by adding variables one by one (forward-selection) or by
removing variables one by one (backward-stepwise) in an attempt to identify and keep only those
variables that produce the best results. This method however, does not address the issue of
multicollinearity. The stepwise approach has a serious problem with multicollinearity because it will
tends to choose the predictors that best match the data sample. It will favor variables that have a high-
magnitude coefficients but not necessarily for the underlying population. Say that you have 2 correlated
predictors if both contribute equally to fitting your data sample there's a chance that you will miss both
in your model if they individually are "less significant" than other non-correlated predictors (which
happened to have high coefficient magnitudes for the sample) and thus were left out of your stepwise
selection.
This approach identified the variables in Table 5: Logistic Regression Stepwise Coefficients as the
predictors that best match the data sample. Those are some of the variables that have a high correlation
with the dependent variable; however, the correlations mentioned above show that those variables have
a significant correlation among themselves as well as with other independent variables. Those
correlations show that other variables such as father’s education and mother’ job also had a high
correlation with the dependent variable; however, the stepwise approach didn’t include them in the
model. The variables chosen as the best predictors for data sample were Family Size, Age, Internet,
Family Support, Paid, Mother’s Education, Study time, and Higher. The collinearity among those
variables caused the model to fail in identifying some of their significance. Looking at Table 5:
Logistic Regression Stepwise Coefficients we can see that although mother’s education is a highly
13
significant variable it is not identified as such. The reason is because mother’s education is probably
equally fitting to the model as other independent variables such as ‘paid’ and ‘higher’. We can also
assume this based on domain knowledge, which is supported by researches done on ‘effects of First
Generation Students’. Furthermore, the correlations also show that there is a high correlation among
those variables themselves as well as with others that are not included in the model.
Results from Decision Tree Using Rapid Miner
Similarly to the multicollinearity issues with the stepwise logistic regression, the decision tree model
fails to identify some variables as significant. Although mother’s education and Father’s education are
highly correlated with the student’s grade they are not identified as significant variables in the tree.
Decision trees classify instances by sorting them down the tree from the root to some leaf node, which
provides the classification of the instance. Each node in the tree specifies a test of some attribute of the
instance, and each branch descending from that node corresponds to one of the possible values for this
attribute (Breiman et al., 1984). As mentioned in the drawbacks of a decision tree, when two variables
both explain the same thing, the decision tree will greedily choose the best one. The correlation matrix
shows that these variables were correlated with other independent variables; therefore, the model chose
the variables that it deemed the better ones. As a consequence of that the model was weakened.
This model outputs a high sensitivity, but a very low specificity. The very low specificity indicates that
there was a high misclassification of students who passed. This model produced a Sensitivity of 64,
Specificity of 35% , Accuracy of 44% , Misclassification of 56% and AUC of 28% This contributed
heavily to a very high total misclassification cost; which proved the model to be unreliable.
Results from Random Forrest Using Dataiku
As mentioned in the drawbacks of the model, the main limitation of the Random Forests algorithm is
that a large number of trees may make the algorithm slow for real-time prediction (Breiman, 2001).
Considering the small size of our dataset that wasn’t a concern. This model produced a Sensitivity of
88%, Specificity of 52% and Total Misclassification Cost of 333. The AUC is 72%. Below is the
confusion matrix that was calculated. Note that the software chose 0.2 as its optimal cutoff. The model
identified the significant coefficients that are also consistent with those identified by the correlation
matrix and PCA output. Notice that because the Random Forrest algorithm is not heavily affected by
multicollinearity this model produced fairly good results.
ROC Curves (all models)
14
The Receiver Operating Characteristic (or ROC) curve shows the true positive rate vs. the false
positive resulting from different cutoffs in the predictive model. The "faster" the curve climbs, the
better it is.
On the contrary, a curve close to the diagonal line is worse.
ROC curves for backwards Stepwise regression and PC Scores regression (Plotted on the same graph)
are shown in Figure 8:ROC, Logistic Regression PCA and Stepwise
The Red curve represents the roc curve for stepwise regression and Blue curve represents the roc curve
for the PC scores regression. Looking at the steepness of the line (where the line looks vertical) it can
be seen that the optimal cutoff for the roc curve is 0.2.
Chosen Model
The logistic Regression using the PC scores (no oversampling) produced the best model. This approach
provided a model with a cutoff of 0.2 that produced a sensitivity (correctly classified fail) of 93.48%,
specificity (correctly classified pass) of 60.36%, accuracy (overall correct classification) of 70% and
misclassification (overall misclassification) of 30%, and the lowest total misclassification cost of 183.
Although some of the other approaches produced models with a sensitivity greater than 93.48%, it is
very important to consider the specificity and the misclassification cost. Hence the reason this model
was evaluated to be the best model.
Implementation (How to Deploy and Intervene)
A computerized database is definitely needed for an easy, cheap, effective and efficient long-term use
of the model. A software can be designed to extract relevant data from the database and execute the
model. This model can be adopted and implemented at either the institutional level or at the ministry
level. Schools can individually choose to collect the necessary data, run the model, and make the
necessary interventions to assist the target students. On the other hand, the ministry of education can
use this tool to identify and gauge the level of additional support that they will provide to a particular
school. The resources that the students need to improve their academic performance is limited;
therefore, the implementation of this model will definitely benefit resource management.
The greatest challenge for the implementation of this tool in Belize is the unavailability of the
digitized data that is needed. Most schools in Belize are still using manual databases; furthermore, they
15
don’t collect much data on student demographics. There is also the concern of data privacy. This
concern may cause students to be reluctant to provide the required information and in some cases they
may provide inaccurate information. The nature of business intelligence/datamining is such that it
requires sufficient and accurate data; therefore, this needs to be taken into consideration.
Challenges
The greatest challenge when designing this model was the availability of data that relates to student
demographics and academic performance. It was not feasible to manually collect the desired data
because of the limited (3 months) time I had to do the research. Data privacy would have also been an
issue if I tried to manually collect the data. Among other things, these issues were some of the reasons
why I could find a larger dataset. As outlined by the drawbacks of the models, the size of the dataset
can affect their performance, hence the reason there is room for improving the models.
Future Work
There are multiple levels for many of the independent variables used in the model. For example,
‘mother’s job’ has 5 levels: 0 = at home, 1 = other, 2 = service department, 3 = health, 4 = teacher.
Future work on this research should modify the model such that it can identify the varying effects at
different levels of the individual variables. This will allow for a more detailed classification and more
customized application of intervention on vulnerable students.
Conclusion
Although it is not feasible to change the factors that relate to the parents, policy makers or the school
can fill in the gap by providing some alternative form of teaching methodology and assessment to the
students identified as those vulnerable to fail. Remedial classes can also be offered. This can increase
study time, compensate for family support, counselling to increase their desire to pursue higher
education, as well as provide resources such as internet access that might not be in the homes.
This research aimed to analyze various student demographics in an effort to classify the
students as a students who will Fail or Pass; however, no historical data (eg. Previous grades, previous
failures, previous absences) was used. As opposed to using data based on previous performance of the
students, this proactive approach is what differentiates this research from most of the other related
16
researches. This model will be easily adopted by any institution regardless of the availability of data on
the students’ previous performance.
Some models produced a very high true classification rate for the students who will fail and a
very small true classification rate for students who will pass; however, that could mean that many of
the students who are classified as failed are students who will actually pass. Although this
misclassification cost is less than the contrary, it still had a great effect on the total misclassification
cost. Correctly identifying and improving the educational achievement of highly vulnerable students
would be evidence of a successful policy design and implementation derived from the solution of this
project.
17
Reference
Breiman L. “Random Forests. Machine Learning”, 45, no. 1, pp 5–32., 2001
Breiman L., F. Jerome, O. Richard, and S. Charles “Classification and Regression Trees”, 1984.
Batista, G. E. A. P. A., Prati, R. C., and Monard, M. C., “A study of the behavior of several methods for
balancing machine learning training data”. SIGKDD Explorations, 6(1), 2004.
Cushman K. “First in the family: Your college years. Advice about college from first generation
students”. Next Generation Press, 2006.
Minaei-Bidgoli “Using datamining to predict secondary school student performance”, 2003.
Paulo C. and D. Silva “Using Datamining to Predict Secondary School Student Performance”,
2008
Pekka K. and K. Nissinen “Background factors behind mathematics achievement in Finnish education
context: Explanatory models based on TIMSS 1999 and TIMSS 2011 data”, 2011.
REUTERS/Keith Bedford, “The challenges of a first generation college student are tough, but not
impossible”, 2015.
Romero C., S. Ventura, P. Espejo, and C. Hervs “Data mining algorithms to classify students,” 1st
International Conference on Educational Data Mining”, June 2008.
Shmueli G. “Data Mining for Business Intelligence”, 2nd ed., Chpt. 5, p. 101”, 2010.
18
Figure 1. Heat Map of Medu Vs Higher
Figure 2. Heat Map of Medu vs G1
19
Figure 3. Heat Map of Higher Vs G1
Figure 4. Scree Plot
20
Figure 5. Parallel Analysis
Figure 6. PCA Loadings Plot
21
Figure 7. PCA Scores Plot
Figure 8. ROC Logistic Regression PCA and Stepwise
22
Table 1. Variance of PC’s
PC1 PC2 PC3 PC4 PC5
ss loadings 3.09 2.07 1.54 1.4 1.31
Table 2 Variance of Variables
PC1 PC2 PC3
sex -0.08 -0.62 0.18
binAge -0.24 0.1 0.56
address 0.35 -0.01 0.23
famsize 0.01 -0.1 -0.15
Pstatus 0.03 -0.1 0.32
Medu 0.8 0.13 0.12
Fedu 0.72 0.13 -0.05
Mjob 0.71 0.24 0.07
Fjob 0.45 0.07 -0.22
reason 0.28 -0.17 0.18
guardian 0.16 -0.03 -0.61
traveltime -0.41 0.08 -0.22
studytime 0.23 -0.49 0.16
famsupport 0.25 -0.17 0.07
paid 0.26 -0.03 0.2
activities 0.19 0.17 -0.03
nursery 0.19 -0.07 0.07
higher 0.43 -0.26 -0.25
internet 0.4 0.16 0.26
romantic -0.09 -0.05 0.38
famrel 0.05 0.11 -0.07
freetime -0.02 0.52 0.08
goout -0.02 0.55 0.19
walc -0.08 0.66 -0.01
health 0.02 0.25 -0.22
G1 -0.38 0.22 0.18
Commented [u1]: Present tables with the minimum use of horizontal rules (usually three are sufficient).
Commented [u2]: Present tables with the minimum use of horizontal rules (usually three are sufficient).
23
Table 3. Logistic Regression, PCA Test Without Oversampling Accuracy Metrics
Test (No Oversampling)
Cutoff Sensitivity Specificity Accuracy Misclassification AUC
Total
Cost
0.5 56.52% 92.79% 82% 18% -36%
0.4 67.39% 86.49% 81% 19% -19%
0.3 78.26% 75.67% 76% 24% 3% 251
0.2 93.48% 60.36% 70% 30% 33% 183
Table 4. Logistic Regression, PCA Test Without Oversampling Confusion Matrix with Misclassification
Cost
Confusion matrix for cutoff of 0.2 (test - no oversampling)
Predicted
Negative Positive
Total
Cost
Actual
Negative 67 44 183
Positive 3 43
24
Table 5 Logistic Regression Stepwise Coefficients
Estimate Std.
Error z value Pr(>|z|) Significance
(Intercept) -0.65274 1.25717 -0.519 0.603613 binAge1 0.44293 0.18702 2.386 0.017867 *
binAge2 -0.03441 0.66696 -0.052 0.958851 famsize1 0.34256 0.18454 1.856 0.063411 Medu1 1.83682 1.23825 1.483 0.137968 Medu2 1.37898 1.2366 1.115 0.264791 Medu3 1.32442 1.23941 1.069 0.285258 Medu4 0.64559 1.24184 0.52 0.603159
studytime2 -0.53506 0.18463 -2.898 0.003755 **
studytime3 -1.55334 0.30479 -5.096 3.46E-07 ***
studytime4 -1.42009 0.42916 -3.309 0.000936 ***
famsup1 0.42009 0.17824 2.357 0.018431 *
paid1 0.71621 0.20131 3.558 0.000374 ***
higher1 -1.62078 0.293 -5.532 3.17E-08 ***
internet1 -0.43206 0.19769 -2.186 0.028845 *