Predicting and Analyzing Absenteeism at Workplace Using...

Predicting and Analyzing Absenteeism at Workplace Using Machine Learning Algorithms

Amarildo Rista *, Jaumin Ajdari *, Xhemal Zenuni* * SEE University/Faculty of Contemporary Sciences and Technologies, Tetovo, North Macedonia

[email protected]; [email protected]; [email protected];

Abstract – Absenteeism is the usual or recurrent absence from work is continuously causing disruption in the smooth running of business, affecting the organizational performance and productivity and impacting on the employees’ morale. The Oil Refinery in Albania (ARMO), employing 1200 employees is facing high rate of absences. If necessary measures are not being serious dealt with, the issue of absenteeism may jeopardize the operation and production. Prediction of absenteeism is too complex influenced by many factors. Usage of data mining and machine learning algorithms is a good solution to predict and analyze it. The aim of this paper is to identify and evaluate the appropriate ML algorithms to predict and analyses absenteeism at workplace. The dataset taken into account consists of some attributes such as: age, education, employment category, day, month, length of service ect, and 125000 records are considered. Analysis and comparison of various algorithms in terms of accuracy, precision and sensitivity are done in Weka tool. Keywords - data mining, absenteeism, machine learning algorithms, prediction.

I.INTRODUCTION

The Oil Refinery in Albania (ARMO) have developed different ways and means in improving employee resource management practices particularly with the objective in reducing absenteeism and to control employee turnover. Employee engagement, motivation and other means such as employee welfare, job performance evaluation and continuous training have been implemented with the objective in reducing employee absenteeism and to improve the organizational performance. Despite of the awareness regarding the adverse effect of absenteeism on the organizational efficiency, the level of absenteeism at workplace still remains high. This paper will focus on prediction and analyses of absenteeism at ARMO, using data mining (DM) and machine learning (ML) algorithms. The ARMO is one of the largest refineries in Balkan that processes oil and its sub-products. The ARMO employing 1200 employees and recently is facing high rate of absenteeism at work. Using DM and ML methods for analyzing and prediction of absenteeism is a good solution. DM is defined as a process used to extract important and useful information from large sets of data [1]. While ML is an application of artificial intelligence (AI) that provides systems the ability to automatically

learn and improve from experience without being explicitly programmed [2]. The first step to analyze data through DM is model designed [3]. A model is an algorithm or set of rules that connect a set of inputs to a specific output. There are two ways to adapt data with design model [4]: 1- Predictive methods; which are used to extract the models that describe important data classes or that predict future data trends. These methods use the known values to predict the unknown values.2-Descriptive methods: represents the recognition of connections that are hidden in the data and provides different results. Based on the features of the dataset and the aim of this paper we have chosen classification methods to analyze our data. These methods use the known values to predict the unknown values. The dataset contains some categories of the reason that provide to be effective in detecting the absenteeism at work such as: age, education, employment category, length of service, residence, civil status, distance from workplace, suffering from any disease, job satisfaction and leadership style. This paper can make a significant contribution in human resource department of an organization. The management can get a sense of understanding of employee activities and behavior which can eventually help in crucial decision making on both monitoring employees at work and recruiting potential employees. The rest of the paper is structured as follows: Section 2 presents a brief discussion on the related works. Section 3 is focused on the data mining classification techniques. Section 4 presents the Methodology. Section 5 presents the result and analysis. And some conclusions are listed in the section 6.

II. RELATED WORK

Application of DM and ML methods in human resource management is a new research field. Predicting and analyzing the absenteeism at workplace play a crucial role in demonstrating the productive and profitable capacity of a company. Some classification techniques are used in [5] to predict absenteeism of employees at workplace. In this paper are analyzed and compared 4 ML algorithms namely Decision Tree, Gradient Boosted Tree, Random Forest and Tree Ensemble. Based on experiments the Gradient Boosted Tree produced the best result with an accuracy rate of 82% compared to other algorithms. In [6] others classification algorithms are

MIPRO 2020/CTI 527

analyzed to find out the probability of an employee being absentee at work for one or more days in a future period of seven days. The authors have analyzed the results of Random Forest, Multilayer Perceptron, Support Vector Machine, Naive Bayes, XGBoost and Long Short Term Memory algorithms concluding that XGBoost was reached the best results being 72% accuracy. Neural Networks and Deep Learning algorithms play an important role to predict the behavior of employees towards punctuality at workplace. In [7], is presented a model designed based on neural network to study the behavior and predict absenteeism of employees at workplace. This model is design based on Deep Neural Network that are composed by multiple hidden layers instead of a single layer of a Shallow Neural Network. The dataset used contains 20 different features and 740 samples which reflect human behaviors. The accuracy results 90.6% compared to 73.3% performance in a single-layer Neural Network and 82% performance in Decision Tree, SVM, and Random Fore. Another similar model based on neural network is presented in [8]. This model is trained with a dataset that has 38 attributes and 2243 records. This model tends to reduce the number of attributes using Rough Sets obtaining good results even in this form. In [9] are implemented three neural network models; Backpropagation, Radial Basis Function and Long-Short Term Memory to solve prediction problem of absenteeism. Based on the experimental results is concluded that Long-Short Term Memory neural network has a prediction rates as 99.9% compared to other. Another ML algorithm is applied in [10]. In this paper are used linear regression and SVR to get predictive absenteeism model. The linear regression is performed with all attributes, while for Support Vector Regression (SVR), are considered two parameters age and seasons of the year. Referring to the results of both algorithms can be used to predict absenteeism at workplace with good accuracy.

III. DATA MINING CLASSIFICATION TECHNIQUES

Classification is considered as the process of finding a model that predicts data classes. Classification techniques [11], are part of predictive methods and are useful to analyze large amounts of data. The main purpose of classification algorithms is to maximize the predictive accuracy obtained by the classification model. There are presented various classification techniques.

A. Decision Tree The Decision Tree [12] can be used for solving regression and classification problems. The Decision Tree create a training model that can be used to predict the class or value of the target variable by learning decision rules inferred from prior data. This model consists of nodes and a root. The root represents the sample and this

further gets divided into two or more sets. The nods are divided in to categories; Decision Node (nodes that are split into further sub-nodes) and Leaf Node (nodes that do not split further). Decision trees classify the instance by sorting them down the tree from the root to some leaf node, with providing the classification of the instances. Each node in the tree acts as a case for some attribute, and each edge descending from the node corresponds to the possible answers to the case. This process is recursive and is repeated for every subtree rooted at the new node. Fig.1 represent the Decision Tree Classifiers.

Figure 1. Decision Tree Classifier

Some based decision tree algorithms are: J48, Hoffeding Tree, Random Tree and REPTTree. The J48 classifies the class attribute based on the input attributes and often is referred to as a statistical classifier. The J48 is based on the C4.5 algorithm [13]. The C4.5 builds decision trees from a set of training data using the concept of information entropy. At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is based on difference entropy. The attribute with the highest difference of entropy is chosen to make the decision. The Hoeffding Tree [14] algorithm can learn from large data streams assuming that the distribution generating examples does not change over time. This algorithm works well with small samples, as it uses the Hoeffding bound which computes the number of observations that are necessary to estimate statistics values within a prescribed precision. Random Tree [15] is a supervised classifier. It employs a bagging idea to produce a random set of data for constructing a decision tree. In standard tree each node is split using the best split among all variables. In a random tree, each node is split using the best among the subset of predicators randomly chosen at that node. This algorithm can be used both classification and regression problems. In the case of use as a classifier, the random trees take the input feature vector, classifies it with every tree in the forest, and outputs the largest class. While in case of a regression, the classifier response is the average of the responses over all the trees in the forest. Random Forests improve the performance of decision trees considerably. It creates a balanced tree where one global setting for the ridge value works across all leaves, thus simplifying the

528 MIPRO 2020/CTI

optimization procedure.REPTree algorithm [16], uses the regression tree logic to creates multiple trees in different iterations. To build the trees it uses the information gain as the splitting criterion, and prunes it using reduced error pruning. REPTTree can be consider as a fast decision tree learner and it is based on C4.5.

B. Support Vector Machines (SVM) SVM [17] is a linear model for regression, classification and pattern recognition. It can solve linear and non-linear problems and work well for various practical problems. The idea of SVM is simple: The algorithm creates a hyperplane which separates the data into classes. This algorithm takes the data as an input and outputs a line that separates those classes if possible. The SVM make a decision in such a way that the separation between the classes is as wide as possible. It is considered a good classifier because of its high generalization performance without the need to add a priori knowledge, even when the dimension of the input space is very high C. Logistic Regression Logistic regression is a generalization of linear regression [18]. It is a statistical model that uses a logistic function to model a binary dependent variable, or multi-class dependent variables although many more complex extensions exist. It is used to classify the low dimensional data having no linear boundaries. It also provides the difference in the percentage of dependent variable and provides the rank of individual variable according to its importance. So, the main aim of Logistic Regression is to determine the result of each variable correctly.

D. Naive Bayes and BayesNet Classifiers NB is considered as a statistic classifier [19]. It can predict class membership based on probabilities. In statistics everything revolves about hypotheses. First is made a hypothesis and then are collected evidence to test that hypothesis. To calculate the probability of a hypothesis the formula as follow is used:

Where:

P(c/x) is the posterior probability of class c given predictor.

P(c) is the probability of class. P(x|c) is the like lihood which is the probability

of predictor given class. P(x) is the prior probability of predictor

NB can be used in real time prediction also is known for multi class prediction feature. It can predict the probability of multiple classes of target variable. Whereas, the Bayes Net is a probabilistic graphical model that represents a set of variables and their conditional dependencies through a directed acyclic graph [20]. The Bayes Net’ graphical model is comprised of nodes and links, where each node represents a random variable and each link describes the probabilistic dependency between two variables.

IV. METHODOLOGY

This section defines the analytical methods to achieve the set objectives by providing clarification and rationalization of the research design, sampling, data collection and analysis.

A. Research design

This research comprises a quantitative data collection where numerical and standardized data were collected to relationships and trends. The dataset that is analyzed consists of 125000 instances and 10 attribute such as: age, education, employment category, length of service, residence, civil status, distance from workplace, suffering from any disease, job satisfaction and leadership style. Based on this dataset are analyzed various ML algorithms. The environment where the experiments were performed is Weka tool.

B. Target Population and sampling

The study population is mainly targeted to staff separated in three operational level where different jobs category where been selected on a balance of percentage. The target population was organized in three groups namely; Operator, Supervisor, Technical and Engineer. The categories and group number of participants (N) are indicated below. Table 1 represent the characteristics of target population.

TABLE 1. CHARASTERISTICS OF TARGET POPULATION

Group Category N Group

1 Level 1 166 Technical and Engineer

2 Level 2 167 Supervisor

3 Level 3 167 Operator

Total 500

A stratified random sampling technique has been chosen to increase the representation of the population. Respondents were randomly selected from each department/section and each respondent has got the equal chance to participate in the selection process.

MIPRO 2020/CTI 529

C. Data Collection Methods

Both secondary and primary data have been considered as useful for this research. Secondary data was used for background understanding of the types and reasons of absenteeism, whereas primary data was collected through survey questionnaire for analyzing attributes that can be affect the absenteeism.

I. Secondary Data

As secondary data we have classified the following documentation; unauthorised absences report, medical survey report, absenteeism follow-up record and absence report. Table 2 summarizes the collection of secondary data.

TABLE 2. SECONDARY DATA COLLECTED

II. Primary data

Survey questionnaire has been designed to collect first-hand data in order to design the attributes of dataset to meet the objectives of this paper. The survey questionnaires were divided into two parts; the first part assessed the demographic data such as age, educational background and the basic working conditions data and the second part addressed the level of job satisfaction and leadership style.

V. RESULT AND ANALYSIS

In this section are presented the experimental results. The experiments are performed in Weka tool and it is used the 10-fold cross-validation procedure [21], which means the dataset is separated into 10 subsets known as folds. Cross-validation is a technique for evaluating machine learning (ML) models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. The DM algorithm uses the 19 subset as the training set and uses 1 subset as a test set to measure the performance of the algorithm. This process is repeated 20 times and as a result the average is obtained. Depending on the size of our dataset k=20 is selected. The dataset is composed by 125000 instances that includes working hours for 500 employees during a year.

A. Assessment Criteria

The evaluation and comparison of algorithms is based in terms of accuracy, recall and precision as in [22].

I. Precision –It is the number of correct positive results divided by the number of positive results predicted by the classifier. Mathematically, it can be expressed as:

(1) II. Recall-It is the number of correct positive results divided by the number of all relevant samples that should have been identified as positive. Mathematically it can be expressed as:

(2) III. F-Measure-It is the harmonic mean between precision and recall. It means how precise is the classifier, how many instances it classifies correctly, as well as how robust it is. It tries to find the balance between precision and recall. The greater the F-measure, the better is the performance of model. Mathematically it can be expressed as:

(3)

True positive (TP): correct positive prediction False positive (FP): incorrect positive prediction True negative (TN): correct negative prediction False negative (FN): incorrect negative

prediction Data preprocessing is the first step in the data mining process. It includes preparation, cleaning, normalization and transformation of data [23]. Fig.2 shows the visualization results from the preprocessing phase and specifically represents the distribution of absenteeism according to the "education" attribute.

Figure 2. Visualize results of dataset in preprocessing phase.

530 MIPRO 2020/CTI

During this phase the distribution of the absenteeism according to the dataset attributes occurs. After preprocessing phase is built the model and the results for each analyzed algorithm are announced. The algorithms we have analyzed are: BayesNet, Naïve Bayes, Logistic, J48, Random Tree, REPTTree and Hoeffding Tree. Table 3 presents the results of these algorithms.

TABLE 3. RESULTS OF ALGORITHMS

As can be shown from the results in table 3, all analyzed algorithms give acceptable results. Can be selected the Bayes Net, J48, RandomTree and REPTTTree algorithms that have very high accuracy and classify more than 99% of instances. The Bayes Net algorithm compared to other algorithms it is simpler to present results. Fig 3, shows the visualize graph of Bayes Net algorithm. This algorithm has classified the absenteeism occurred based on all dataset attributes: age, level of education, day, month, length of service, civil status, distance from workplace, job satisfaction, employment category and leadership style.

Figure 3. Visualize graph of Bayes Net Algorithm

After building the model and evaluating the data, referring to the Bayes Net algorithm results is concluded that: employees who are between 20 -30 and 60-70 years old tend to have more absences at work. Regarding to the level of education, employees with secondary school have a higher number of absences. Also, employees who live in the village, married, have a distance of 20 km from the workplace, work with turn, suffer from a disease tend to absence more at work. Regarding to the length of service, employees over 40 years of experience tend to absence more at work. While regarding to leadership style and job satisfaction, employees that are expressed “neutral” tend to absence more at work. Regarding the type of absences, it is noticed that employees that have secondary education are over 50 years old, live in city,

work with turn, are part of level 1 employ category tend to make more unauthorized absences.

VI. CONCLUSION

Absenteeism at workplace is continuously causing disruption in the smooth running of business, affecting the organizational performance and productivity and impacting on the employees’ morale. The focus of this research was to identify and evaluate the appropriate ML algorithms to predict and analyses absenteeism at workplace. The analysis is focused on data obtained from the ARMO company operating in Albania. The attributes that have been studied are: age, education, employment category, length of service, residence, civil status, distance from workplace, diseases, job satisfaction and leadership style. By analyzing and comparing various ML classifier algorithms in terms of precision, sensitivity and accuracy, it is concluded that Bayes Net, J48, Random Tree and REPTTTree algorithms have very high accuracy and classify more than 99% of instances. The Bayes Net algorithms compared to other algorithms it is simpler to present results. Usage of DM and ML techniques s is a good solution to predict and analyze the absenteeism at workplace. These techniques give a significant contribution in human resource department of an organization. The management can get a sense of understanding of employee activities and behavior which can eventually help in crucial decision making on both monitoring employees at work and recruiting potential employees. This research could be used as a relevant instrument for further investigations in the future at ARMO or other companies.

REFERENCES

[1] David L Olson, Dursun Delen,"Advanced Data Mining Techniques" Springer 2015.

[2] W. James Murdocha,Chandan Singhb,Karl Kumbiera, Reza Abbasi-Aslband Bin Yua "Interpretable machine learning: definitions,methods, and applications",PNAS 2019

[3] Bhatnagar V. and Gupta S.K. Modeling the KDD Process. Encyclopedia of Data Warehousing and Mining, Second Edition. Information Science Reference, Hershey, New York, f.1337 – 1344, 2008

[4] Usama, M.Fayyad, et al., Advances in Knowledge Discovery and Data Mining Cambridge, Mass.MIT Press, 1996

[5] Zaman Wahid ,Zaman Wahid,A K M Zaidi Satter ,A. K. M. Zaidi Satter,Abdullah Al Imran,Abdullah Al Imran,Touhid Bhuiyan,Touhid Bhuiyan "Predicting Absenteeism at Work Using Tree-Based Learners",Proceedings of the 3rd International Conference on Machine Learning and Soft Computing,ACM 2019

[6] Evandro Lopes de OliveiraJosé M. Torre,"Absenteeism Prediction in Call Center Using Machine Learning Algorithms", Springer 2019

[7] Syed Atif Ali Shah, Irfan Uddin , Furqan Aziz , Shafiq Ahmad ,Mahmoud Ahmad Al-Khasawneh,and Mohamed Sharaf "An Enhanced Deep Neural Network for Predicting Workplace Absenteeism",Hindawi ComplexityVolume 2020, Article ID 5843932, 12 pages

[8] Ricardo Pinto Ferreira., Andréa Martiniano., Domingos Napolitano, Edquel Bueno Prado Farias and Renato José Sassi"A rtificial Neural Network and Their Application in the Perdiction of Absenteeism at Work ",International Journal of Recent Scientific Research Vol. 9, Issue, 1(G), pp. 23332-23334, January, 2018

MIPRO 2020/CTI 531

[9] Kagan Dogruyol,Boran Sekeroglu, "Absenteeism Prediction: A Comparative Study Using Machine Learning Models",10th International Conference on Theory and Application of Soft Computing, Computing with Words and Perceptions 2020

[10] Krittika Tewari,Shriya Vandita,Shruti Jai,"Predictive Analysis of Absenteeism in MNCS Using Machine Learning Algorithm" Springer 2020

[11] Thair Nu Phyu,"Survey of Classification Techniques in Data Mining",Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol IMECS 2009, March 18 - 20, 2009, Hong Kong

[12] Himani Sharma,Sunil Kumar "A Survey on Decision Tree Algorithms ofClassification in Data Mining",International Journal of Science and Research (IJSR)

[13] Quinlan, J. R.: C4. 5: programs for machine learning. Elsevier (2014).

[14] Hamoud, A. K.: Selection of Best Decision Tree Algorithm for Prediction and Classification of Students Action. American International Journal of Research in Science, Technology, Engineering and Mathematics 16(1), 26–32 (2016).

[15] Ian H. Witten, Eibe Frank & Mark A. Hall., “Data Mining Practical Machine Learning Tools and Techniques, Third Edition.” Morgan Kaufmann Publishers is an imprint of Elsevier.

[16] [ Ian H. Witten, Eibe Frank & Mark A. Hall., “Data Mining Practical Machine Learning Tools and Techniques, Third Edition.” Morgan Kaufmann Publishers is an imprint of Elsevier

[17] Janmenjoy Nayak, Bighnaraj Naik and H. S. Behera,"A Comprehensive Survey on Support Vector Machine in DataMining Tasks: Applications & Challenges",International Journal of Database Theory and Application Vol.8, No.1 (2015), pp.169-186

[18] Charu C. Aggarwal, "Data classification: algorithms and applications,ACM July 2014

[19] Raj Kumar,Rajesh Verma,"Classification Algorithms for Data Mining:A Survey",International Journal of Innovations in Engineering and Technology (IJIET)

[20] Remco R. Bouckaert "Bayesian Network Classifiers in Weka",University of Waikato, 2004

[21] Tzu-TsungWong,"Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation", Elsevier 2015

[22] https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234 ,accessed 27/04/2020

[23] García, Salvador, Luengo, Julian, Herrera, Francisco "Data Preprocessing in DataMining" Springer 2015

532 MIPRO 2020/CTI

Predicting and Analyzing Absenteeism at Workplace Using...

Documents

Transcript of Predicting and Analyzing Absenteeism at Workplace Using...