1 Introduction

Human resource analytics is a special part of analytics where the main focus is the human resource. In HR analytics, the analytical process is applied to the organization’s human resources. The main objective of this process is not only to enhance employee performance but also to improve overall employee satisfaction. HR analytics helps institutions by providing useful insights for the human resource functions by collecting data and transforming it into useful information for empowering human resource-related processes. HR analytics is very useful for improving HR functions such as workforce management, recruitment, performance evaluation, and development management. In this chapter, four different HR problems are focused, and these problems are tried to solve by using HR analytics tools. The rest of the chapter is organized as follows. Chapter two gives the four different HR problems, namely attrition risk, recruitment, performance evaluation, and training planning. These problems are defined via a comprehensive literature review. For each problem, a case study is defined and HR analytics tools are applied. The last section concludes and gives further suggestions.

2 Employee Turnover (Attrition)

Employee turnover (i.e., attrition) refers to the number (or ratio) of employees leaving the organization either voluntarily or involuntarily. Voluntary turnover is based on the decision of employees, while the involuntary one occurs when an employer decides to end the employment relationship, as a result of the death or retirement of the employee [1, 2]. From an organization’s perspective, performance under expectations, improper behavior, or adaptation problems regarding the culture or working environment may cause the termination of the employment agreement [1, 3]. Since the control of the organizations over the employee-related involuntary turnover is low, in the scope of this study, the voluntary turnover will be examined.

Starting from recruitment, organizations invest in each employee through selection, orientation, training, improvement, and positioning. When an employee leaves the organization, then the position should be filled immediately with an appropriate candidate. Accordingly, an employee loss causes a significant cost as well as affecting the organizational value [4], the long-term strategies [5], the competitive advantage [1, 6], the organizational performance [7, 8], and even loss of customers, especially if the leaving employee is a talented and highly capable one. Being an important factor affecting the organization’s overall performance, employee turnover deserves considerable attention from both of the organizations and the researchers in order to enable taking precautions to diminish (or manage) the employee loss. Thanks to the developments and improvements in the learning-based data analysis techniques, the employee data representing the reasons for the voluntary employee attrition can be examined, and whether the employee quits or not can be predicted. The common approach in the recent literature on the prediction of employee turnover is applying different methods or algorithms and comparing them based on the performance measures such as accuracy, precision, recall, specificity, the area under the curve (AUC), and F-measure. In Table 1, some studies are provided along with the performance measures and the dataset utilized.

Table 1 Learning-based methods to predict employee attrition

According to a detailed literature review, gender [1, 6, 8,9,10], state of origin [1], duration of service [1, 4, 8,9,10], the title in the organization or job level [1, 6, 8, 9], annual or monthly salary [1, 4, 8, 9], job satisfaction [4, 6, 8,9,10], environment satisfaction [6, 8,9,10], relationship satisfaction [8, 9], performance of the employee [4, 6, 8, 9], work–life balance [6, 8,9,10], the number of trainings the employee participated in the previous year [8, 9], promotion [4], feeling appreciated [10] are among the factors considered for the prediction. Even though several algorithms and methods including support vector machines [4, 5, 8, 9, 11,12,13], AdaBoost [4, 8], neural networks [13], KNN [8, 9, 11, 12], and naïve Bayes [5, 8, 9, 11, 12] are used, the methods preferred considering the performance measures are decision tree [1, 4, 8, 9, 11,12,13], random forest [4, 5, 8,9,10, 12, 13], and XGBoost [5, 11, 13, 14].

To achieve a well-managed prevention process to diminish employee attrition, the organizations better consider the most important factors affecting employees’ quitting decisions. According to recent studies, the relatively high attrition is mostly related to the lower involvement [10], motivation, and satisfaction of employees [9]; thus, improving the working environment, balancing the workload, and sustaining a strong relationship between the employee and management may help to deal with this problem [6, 10]. Using existing employee data to extract invaluable information may support organizations to enhance their decision-making capabilities regarding employment strategies.

Case Study: Employee Turnover (Attrition)

In this use case, a machine learning model was built to predict employee attrition by using an employee dataset. The dataset is a modified version of IBM HR Analytics Employee Attrition Data [50]. The main research question is to figure what is the attrition risk of my current employees and what are the factors that affect employee attrition?

The dataset includes 20 variables of 1102 employees in a company. The list of variables and their definitions can be seen in Table 2.

Table 2 List of variables_employee attrition

For the next step, unnecessary variables were removed, and a brief exploratory data analysis was conducted. Since the main goal of this analytics model was to predict the attrition risk of employees and identify the reasons behind attrition, a bivariate analysis was proceeded to understand the distribution of attrition. To conduct the bivariate analysis, the data types of the features were modified and checked. The dataset included 4 categorical, 16 numeric, and 1 string features. Since the feature “EmpID” represents the unique ID of each employee, it was removed from the analysis. Additionally, null values were checked for each column. It was seen that the dataset did not include any null value. The attrition ratios of former and current employees for each categorical variable is compared by using bi-variate analysis. The results of this anaysis can be given as follows:

  • For the education level   =  1, attrition level %17.3 is slightly higher than the overall ratio (%15.6)

  • For the education level  =  5, attrition level %7.4 is lower than the overall ratio (%15.60)

  • The employees whose wages are less than the company mean (Wage rate  =  1) have a slightly higher attrition ratio (%17.9) than the other ones

  • The employees who are working remote (Working Model  =  2) have a higher attrition rate (%25.4) than the other groups

  • Employees working in the office (Working Model  =  1) have the lowest attrition level when compared with the other two working models.

  • Employees from departments “0” and “2” have slightly higher attrition ratios than the expected mean.

As a result of the exploratory data analysis, it is possible to say that all of the features within the dataset could be used for modeling. Possible signs of relations between target column “Attrition” and dependent variables such as “Age,” “Monthly Income,” “Years in current role,” “Education Level,” and “Working Model” are possible predictors of attrition of employees. However, since these comparisons were conducted on variable levels separately, the outputs of the model should be analyzed to understand the complex relationships.

For the next step, a classification model was applied to predict the attrition risk of current employees. All of the variables except “EmpID” were used as predictors and the variable “Attrition” was used as the target. In this study, three different machine learning algorithms were applied to the Attrition dataset. The results showed that “LightGBM” classifier outperformed the other two algorithms. Since data is imbalanced with an attrition rate of 15.6% checking, both accuracy and F1 scores are crucial to assess model performance. Even though the decision tree model is a simpler model than random forest and LightGBM, it outperformed the other two with a recall score 0.34 and F1 score 0.32. However, its overall accuracy is worse than the other two (0.76). Random forest model has a higher accuracy rate, but the recall metric is considerably low (0.09). Nonetheless, F1 Score is 0.16. LightGBM has a better F1 score (0.27) than random forest. Additionally, the recall metric for LightGBM is also higher than random forest (0.18). Additionally, in terms of overall accuracy, it outperformed the other two with a score of 0.84. As a result, the LightGBM algorithm was chosen to predict the employee attrition risk with the best performance metrics.

For the next step, the feature importance of the LightGBM model was calculated. The feature importance plot is given in Fig. 1.

Fig. 1
figure 1

Future importance_employee attrition

Figure 1 shows that “Distance from Home,” “DailyRate,” and “MonthlyIncome” are the most important three features for the classification model.

Deep dive analysis provided that if the Dailyrate decreases, the attrition risk of employees increases. Lower monthly income increases the risk of attrition. Low levels of distance from home decrease the risk of attrition.

3 Recruitment Analytics

Organizations aim to hire appropriate employees having the appropriate knowledge, skills, and abilities to contribute to the success and sustain the competitive advantage, ensuring the right fit for the vacant positions is in the scope of the human resources recruitment function.

Recruitment, consisting of activities to be performed to find and attract qualified employees, affects the organizational performance through the success of the subsequent HR functions such as selection, compensation, and training [15]. Recruiting requires considerable knowledge and expertise, especially as it is a data-driven and knowledge-sensitive process, to attract the job seekers to apply to the job openings, maintain the interest of job seekers in the position as long as being open, review the curriculum vitae (CV) of the candidates, and match the requirements of the com company with potential applicants [16, 17]. As involving a huge amount of data, automatizing the process is among the primary concerns of both the practitioners and researchers. Turning the existing data into invaluable business intelligence, recommender systems become appealing both to the recruiters and to the job seekers.

In order to offer job seekers positions attracting their attention, job recommender systems are developed. For instance, to match job seekers with available job openings, a career platform classifies both the CVs of the registered users and available job titles using support vector machines and k-nearest neighbor methods [18]. A job recommendation system consisting of random forest and support vector machines is proposed to target the price and location sensitive people separately [19].

To support recruiters, an automated recruitment system imitating the knowledge of the recruiter, extracting data from CVs, and matching the requirements and profiles is built [17]. In a study proposing a new data-based design for electronic human resources management (e-HRM) activities including e-recruitment, e-selection, e-performance management, e-compensation, and e-learning, attrition is used to predict potential positions to be input to the e-recruitment system using decision tree [classification and regression trees (CART)], support vector machines, and k-nearest neighbor methods [20]. Decision tree performs better compared to other methods. The effectiveness and usability of these systems are quite important for their prevalence. Accordingly, information quality, popularity, and security of the source are found to be the factors affecting the effectiveness of e-recruitment [21].

Recruitment sources should be examined carefully due to their effect on employee performance, employee turnover [22], or employee loyalty [23]. Even though there is not any consensus on the sources resulting in the recruitment of employees performing better, several studies are conducted to understand the nature of the relationship. The primary reasons of differences in the prominent sources of recruitment can be summarized as follows: (i) using a different number of sources [24] and even combining them into different groups (e.g., internal and external [25], being controllable or uncontrollable by the organization [16]), (ii) assuming that the job seekers and/or recruiters use single source and neglecting the usage of multiple sources [24], (iii) the differences in the employee demographics or characteristics, (iv) the differences in the type of jobs (i.e., white/blue-collar positions, permanent/temporary jobs, etc.) [24], (v) the differences in the type of organization and industry (i.e., local/global [26]), (vi) the differences in sample sizes [24].

An organization might fill existing or expected vacancies internally by recruiting its own employees or externally [16]. Interdepartmental transfers, promotions, and internship programs are considered in internal sources while external sources of recruiting include recommendations, employment agencies, universities and colleges, referral programs, job fairs, professional associations, and trade unions, rehiring former employees, and advertisements, as well as employee exchange [24,25,26].

Performance-related studies indicate that internal sources have a significant effect on job performance [25], employees hired using employee referral have higher job performance than those hired through other methods [27], and agencies do better in finding the appropriate employees [28].

Case Study: Assessing Recruitment Source Through Employee Performance and Engagement

In this case, an analytics approach was applied to an HR dataset to assess the best medium for hiring. The dataset used in this case is a modified version of IBM HR Analytics Employee Attrition Data [50]. The main objective is to assess recruitment sources through employee performance and engagement scores and define which recruitment source is better to ensure employee performance.

The dataset includes Performance Score, Organizational Engagement Score, Positional Engagement Score, and source of hiring for 852 employees in a company. A sample dataset is given in Table 3.

Table 3 Sample dataset recruitment

Brief statistical analysis of numerical variables showed that the performance score deviates between 0 and 100, whereas engagement scores deviate between 0 and 5. If the correlation between numerical features is checked, it is seen that Organizational and Positional Engagement scores are highly correlated (0.83). Additionally, the Performance score is correlated with Position Engagement with a correlation coefficient 0.65 but less correlated with Organizational Engagement with a coefficient of 0.54.

To select the best hiring source, the target variable using engagement and performance scores is defined. Since organizational engagement is related to employee commitment to the company and positional engagement is related to employee’s willingness to stay in the current role, both of them are linked with long-term individual performance. By using these three indicators, employee data is segmented and every employee is classified in a segment that represents their engagement and performance scores. The segmentation is conducted by applying an unsupervised machine learning algorithm and on the next step, hiring sources are evaluated according to these created segments.

By using K-means clustering algorithm and elbow method, the optimum number of employee segments is determined as 4 (Fig. 2).

Fig. 2
figure 2

K-means analysis-recruitment analysis

After the cluster analysis the employee groups are defined as follows: employees with a low level of engagement and performance as “Low,” employees with mid-level performance level and mid-level engagement as “Medium,” and employees with high performance and high engagement as “High”. Since engagement is an important factor for sustainable performance, it is possible to assert that hiring source that provides more employees in class “High” is a better method for recruiting engaged and high-performer employees. The distribution of hiring sources and the employee clusters are given in Table 4.

Table 4 The distribution of hire source and the clusters

When the total distribution of high performers are compared for each hiring source, it is possible to say that “Search Firm” has the highest high performer ratio with %34.4 which is higher than the overall average (%27.5). On the other hand, it is possible to say that the hiring source “Referrals” has the lowest high performer ratio with %22.5 which is lower than the overall average (%27.5). As a result, a clustering approach based on the performance and engagement scores of employees showed us that “Search Firm” is the best hiring source to ensure long-term success within the company.

4 Performance Analytics

Performance management, an approach for enhancing the overall targets of the organization by improving the performance of the employees (both individuals and teams) to benefit from their potential, ensures that the employees embrace the values and the goals of the organization [29]. An effective performance management approach requires measuring employee performance using appropriate tools and techniques to provide input to other HR functions such as career planning and training, develop strategies, or take necessary actions related to possible risks.

Employee performance directly affects the success, survival, and competitive power of the organizations. So, it is expected that the performance assessments support the managerial decisions by providing the necessary information. Considering the dynamic and rapidly changing nature of the business environment, the costly and time-consuming traditional performance assessment tools and techniques may fail to provide instant real-time information to support the decisions [30, 31]. At this point, data analytics methods and algorithms are used to observe the actual performance and even take precautions for future deteriorations likely to be based on continuous predictions [32].

In order to predict the employee performance, stochastic gradient descent and bagging classifier [33], neural networks [34], decision tree [33, 35,36,37], naïve Bayes [34, 36, 38], logistic regression [33, 34, 36, 39], random forest [33, 34, 39], and support vector machines [34, 39] are used. As given in Table 5, classification methods and algorithms are preferred. As the employee characteristics and the nature of the work have an impact on the performance, these studies consider various factors including gender, age, education, marital status, industry, organizational support, total experience in years, title of the position (e.g., managerial or non-managerial), leadership type, nature of the task (e.g., complexity), socio-economic status, location, income group, wage system, workspace and environment, task complexity, motivation, and competence, [34,35,36,37]. Personality-related factors such as the abilities of creative thinking, conflict resolution, decision-making, and personal relationships are also included in the models [34].

Table 5 Methods to predict employee performance

With the rise of information technologies, digitalization, and other related technologies [31], performance management activities are evolving to become more artificial intelligence-oriented (AI). Using AI is expected to result in continuous, real time, flawless data with a robust and unbiased performance management process [40].

Case Study: Who are the best performers? Identifying top manager profile through analytics

In this case, an analytics approach was applied to assess the profile of top-performer managers using an employee dataset. The dataset is a modified version of IBM HR Analytics Employee Attrition Data [50]. The objective of this case is to identify the top performer managers and their common characteristics.

The dataset includes 8 variables of 218 managers in a company. The list of variables and their definitions can be seen in Table 6.

Table 6 List of variables and their definitions

To identify top performer managers within the company, evaluation scores of subordinates and business performance outcomes were analyzed. Basic descriptive statistics show that TeamAsessment has a mean of 2.24, and PerformanceScore has 54.2. The correlation coefficient between TeamAssessment and Performance Score is −0.0089 which indicates that these two variables are not correlated. To identify the top performer managers, firstly, managers who have higher performance scores than the company average were identified. Secondly, managers whose team assessment scores were higher than the company average were selected.

Analysis results proved that there are 60 top performers and 55 low-performer managers. Comparison of basic descriptive statistics of top and low performers show that top performers’ average age is lower than low performers (41.3–37.3). Secondly, top performers have a higher total experience average and a lower tenure than low performers. Figure 3 illustrates the tenure density plot for top and low performers.

Fig. 3
figure 3

The tenure density plot for top and low performers

The density plot showed that the number of top performers who have a tenure less than 5 is higher than the low performers. On the contrary, the number of low performers with a tenure of more than 5 is higher than the top performers. Even though distributions of top and low performers in terms of total experience are similar, more top performers have a total experience of more than 15. When the top and low performers’ education levels are compared, it is seen that the number of top performers who have a Master’s degree is more than the low performers. On the contrary, the number of low performers who have a high school degree is higher than the top performers.

Analysis of the education field marks a considerable difference between technical degree and life sciences graduates. Figure 4 shows the performance difference between education degrees.

Fig. 4
figure 4

The performance difference between education degrees

The number of top performers with a technical degree is much higher than the low performers. On the contrary, the number of low performers with life sciences degrees is much higher than the top performers. When the number of top and low performers are compared in different organizational departments, it is seen that the number of top performers is slightly higher in information technologies and lower in operations. In summary, the common characteristics of the top performers are they are below 40, have a tenure less than 5, have a total experience of more than 10, have a technical degree in IT and marketing in operations.

5 Training Analytics

Training and development have a significant role in the professional and individual improvement of employees in terms of knowledge, skill, and attitude [42, 43]. As a long-term process aiming to enhance employee performance considering the possible requirements of the job which may arise in the future, the development covers strategic level improvements [43].

Effective training and development support the motivation [44] and self-fulfillment of employees and in turn, reduce employee turnover, ensure successful and sustainable employee retention strategies [45], and contribute to the performance, growth, and competitive power of organizations [46].

Similar to other HR functions, data-based methods and algorithms are used in examining the training and development activities. To automatize the assessment of employees’ learning types, the determination of training materials and the development of training strategies, an artificial intelligence-based expert system consisting of a rule-based approach and association rule mining is proposed [47]. This expert system operates using employee information such as department, seniority, and position.

Classifying employees based on their knowledge and expertise also supports HR to determine the training needs. K-means clustering is used to classify the expertise of employees (four classes from excellent to poor) and four classification algorithms including AdaBoost, random forest, support vector machines, and logistic regression are used to identify the development potentials where the random forest performs better with 0.80 accuracy [48]. Employees are classified whether being a part of personalized training or not using decision tree, random forest, and support vector machines where the random forest performs better with 0.963 accuracy [43]. This study uses a dataset consisting of variables including department, length of service, the number of previous training, average training score, and awards.

Besides, the department and the industry organizations operate in [43, 46], the tools of training are considered in the studies examining the relationship between training and development and employee performance. Training on the job, coaching, mentoring, using development centers or project teamwork may affect the performance of employees or organizations [46]. Table 7 summarizes development management studies.

Table 7 Training development management-related studies

Case Study: Data-driven training and development plan through analytics

In this case, an analytics approach was conducted to identify the best training method for the employee to increase performance. The dataset is a modified version of IBM HR Analytics Employee Attrition Data [50]. The main questions of the study are: “Which training method is more effective than others in terms of employee performance? Are there any differences between different departments in terms of training method and performance? Can we predict employee performance through the number of classes and online training?”

The dataset includes performance rating, department information, and training types of 1102 employees of a company. In this case, the main question is to understand which training type is more effective in terms of employee performance. There are multiple factors to evaluate the effectiveness of training. Performance Rating is the observed employee performance. It is the most tangible output of training investments. For this reason, the effect of training on employee performance is explored. Since the training needs of different departments could vary, the relationship between employee performance and training type will be explored on the department level. The list of variables and their definitions can be seen in Table 8.

Table 8 List of variables_training and development plan

In the dataset, there are 43 employees from finance, 709 employees from IT, and 350 employees from sales. When correlation analysis is performed, it is seen that relatively high correlations between performance rating and training types. “Online training weeks” is positively correlated with performance rating with a Pearson coefficient of 0.44. “In Class Trainings” is positively correlated with performance rating with a Pearson coefficient of 0.33.

Boxplot analysis of Class training variable shows that the distributions of three organizations are identical. Secondly, boxplot analysis of Performance Rating given in Fig. 5 shows that IT has a higher average score than Finance and Sales Finance which has a very narrow distribution with low variance. The standard deviation for Finance is 0.09, whereas for IT, it is 0.52 and for Sales it is 0.86.

Fig. 5
figure 5

Boxplot analysis of performance rating

Exploratory data analysis showed that organizational function is an indicator for performance rating where IT has higher values. To understand the relationship between performance rating and online/class trainings, firstly, the relationship between the target variable performance with other two numeric variables are controlled, and then drilling down to the organizational department level are performed. When the relationship between performance rating and online trainings are explored, a very strong correlation in IT department are occurred. The correlation table on organization department level shows that, Performance scores and online trainings are strongly positively correlated in IT department (0.84). The correlation table between Performance scores and class trainings shows that these two variables are strongly positively correlated in sales department. Since the effect of online and class trainings on employee performance are wanted to be explored, the high correlations in Sales and IT reveal provide a baseline for further analysis. Even though the performance of employees within a company could be related with many factors, the high correlations in Sales and IT enable us to assert that number of class trainings could be a good indicator in Sales department and the number of online trainings that an employee gets could be a good indicator for performance in IT department. In the next step, a machine learning algorithm will be applied to predict employee performance through available dependent variables Organizational Function, Online training hours, and the number of class trainings.

To predict employee performance through online and class trainings, a lightgbm regressor model was built. The results show that the model can explain the %78 of target variable (r2_score: 0.78) which is a very good score. Root mean score of the model 0.32 which is also a good error value when it is compared with the mean of target variable performance score (3.1) and standard deviation (0.71). Lastly, shap values of the model were calculated to explain the behavior of the lightgbm model (Fig. 6).

Fig. 6
figure 6

Shap value_trainin and development management

The results of the predictive model show that it is possible to predict employee performance through online and class trainings and organizational department information. Shap values show that higher values of Online trainings also increase the performance score. The exploratory analysis showed this relationship is since Online trainings are highly correlated with performance score in IT department.

In this chapter, the importance of HR analytics is highlighted and showed how HR problems can be solved by using HR analytics tools. HR analytics can be defined as to collect data and transform it into useful information for improving human resource management. HR analytics is very useful for organizations since they create useful insights for managing the organization and help them to reach strategic goals. HR analytics is especially very useful for improving HR functions such as workforce management, recruitment, performance evaluation, and development management. In this chapter, four different HR problems are handled and tried to solve these problems by using HR analytics tools. For future studies, HR analytics tools can be applied to various HR problems.