Cis520 group e

20
HUMAN RESOURCES BIG DATA ANALYTICS Enrique Romero Monalisa Chaure Rouben Oshian Dr. Jongwook Woo Submitted to : Submitted by :

Transcript of Cis520 group e

Page 1: Cis520 group e

HUMAN RESOURCES BIG DATA ANALYTICS

Enrique RomeroMonalisa ChaureRouben Oshian

Dr. Jongwook WooSubmitted to : Submitted by :

Page 2: Cis520 group e

1. BackgroundOver the course of their day-to-day operations, Human Resources teams are finding themselves with increasingly large pools of data comprised of employee profiles, activity records and performance appraisals. Leveraging this data for insights has become a challenge, presenting a key demand and opportunity for big data analysis. Our goal is to address these pain points of today’s employee management by interpreting HR data into descriptive and inferential statistics. The analytics produced can provide HR professionals the information to make effective description regarding their workforce.

2. OverviewTo prove our concept, we will acquire and distribute the raw data across a Hadoop cluster. We will then query the data to provide basic filtering functions to demonstrate its flexibility and customization. In addition, we will apply analytical tools to the refined data, representing set business intelligence tools that are in line with today’s human resources management needs.   3. System Requirements Storage: Standard cloud-based object storage node with a business continuity system in place. File System: Hadoop Distributed File System MapReduce Platform: Apache Hive Analytics and Visualization Tool: Microsoft Excel enabled with ODBC driver and Microsoft Power BI. 

Page 3: Cis520 group e

4. Dataset17 text files of comma delimited data that separates columns and each row starting a new record. Each record has 20 data elements in numeric and text format. The data set is comprised of employment records of various government agencies of the United States.

Page 4: Cis520 group e

5. Storage DeploymentThe storage is deployed from the Azure portal, configured to be a classic Azure storage account located in the Central United States. It is Geo-Redundant with local and geo-distributed copies with a replication factor of three for disaster recovery. It can accommodate block and page blobs, tables and queues with a maximum 500 IOPS per disk.

Page 5: Cis520 group e

6. Hadoop Cluster DeploymentThe selected Apache Hadoop distribution system is an Azure HDInsight cluster. It is deployed from the Azure portal, configured to have a Windows operating system and Hadoop version 2.6.0. Its resources are comprised of 4 workers nodes with a total of 16 cores, 14GB RAM and 8 disks. The head node is set-up with the same specifications.

Page 6: Cis520 group e

7. Data UploadThe dataset is transferred using the Cloudberry Explorer client application. It is uploaded directly into the default container of the blob linked to the HD Insight cluster

Page 7: Cis520 group e

8. Querying the DataThe MapReduce queries are performed from Azure’s Hive Editor.

a. Table Creation: CREATE TABLE query is used to create the table with the appropriate data.  

Page 8: Cis520 group e

b. LOAD DATA INPATH queries are used to create the table with the appropriate data.

Page 9: Cis520 group e

c. SELECT queries are used to validate the information loaded into the table.

Page 10: Cis520 group e

9. Data Refinement: The refinement of the data is based on the identified business requirements. The filtering is performed with the COUNT, WHERE and GROUP BY queries using a variety of conditions.

a. COUNT

Page 11: Cis520 group e

b. WHERE

Page 12: Cis520 group e

c. WHERE

Page 13: Cis520 group e

d. GROUP BY

Page 14: Cis520 group e

e. GROUP BY and WHERE

Page 15: Cis520 group e

10. Data Visualization and InterpretationThe data infers the following conclusions:

For demographic analysis, the first query generated displays all the female employees working in the Department of Defense- Defense Contract Audit Agency

Page 16: Cis520 group e

10. Data Visualization and InterpretationThe data infers the following conclusions:

The second query addresses the position-vacancy analysis need of HR professionals, by searching for the employees with a Length of Service of 30 years and above since they have the higher likelihood to retire.

Page 17: Cis520 group e

10. Data Visualization and InterpretationThe data infers the following conclusions:

a. Graph 3.10.1 shows that salary is distributed mostly to salary levels D, E and G which account for 12% of the salary each. This means that 36% of the total salary expense is allocated to those who make $40,000 - 59,999 and $70,000 - 79,999. Salary level F is next behind the top three, taking 10% of the total salary expense. Level F has a salary range of $60,000 - 69,999.

Page 18: Cis520 group e

b. Graph 3.10.2, shows that majority of the personnel with a supervisory level of 2(Manager), 6(Leader) and 7(Team Leader) have an education level of 13, a bachelor’s degree. More than 80,000 supervisors share this education level, followed by level 4, a high school diploma, which is shared by 64,700 supervisors. Together, the supervisors part of these top two educations levels, account for 49% of them within this specific group.

Page 19: Cis520 group e

11. Conclusiona. System Review - The system represents an effective and user-friendly platform to manipulate and enhance large amounts of HR data, created in an efficient and cost-effective manner. b. Opportunities - Globalization, advancements in technology and the even the growing population in general, will only mean more jobs and people to manage in the future. And “without analytics, corporations could face an increase in skills gaps throughout the entire company, less engaged employees, a lack of internal development, along with many other challenges” says Brittany Hink, Editor in Chief of Human Resources IQ [2]. These factors and trends will result in a continuous explosion of data that will be very important to tap into and interpret just to perform daily human resources management operations. In summary, human resources is definitely an aspect of business and human behavior that Big Data analytics should focus more on.

12. ReflectionOur team learned a great deal on big data and the impact of human resources departments have on global business as a whole. More importantly, we learned about how critical the System Development Lifecycle can be as we had difficulty in implementation and testing that were correlated. We learned to respect the SDLC process as it is proven framework for establishing systems that really work.

Page 20: Cis520 group e

Thank you!