1 Introduction

With the rapid growth of industrialization, firms are chasing profit and ignoring environmental pollution. Corporate activities have infiltrated the natural environment’s equilibrium, triggering global warming, climate change, and harmful waste generation (Crace & Gehman, 2022; Dhasmana et al., 2023). The world has now begun to pay attention to sustainable development and the use of available resources without sacrificing the requirements of future generations. Since poor environmental quality can be a significant obstacle to economic development, all stakeholders, including institutions, customers, and investors, should think about how they can become more sustainable (Chung et al., 2023; Nekhili et al., 2021). As a result, environmental, social, and corporate governance (ESG) has emerged as a tool to assess the extent to which a corporation works on behalf of societal purposes that go beyond the corporation’s function of maximizing profits on behalf of its shareholders (Billio et al., 2021; Kalaitzoglou et al., 2021; Liagkouras et al., 2020). ESG assesses a company’s efforts in terms of energy efficiency, greenhouse gas emissions, waste, water, and resource management.

ESG rating goals include not only environmental issues, but also discrimination (such as gender and race), human rights issues (such as child labor), and incorporating information from smart sanctions lists issued by nations and global companies around the globe (D’Amato et al., 2021; D’Apice et al., 2021; Semenova & Hassel, 2019). In 2019, 300 mutual funds with ESG mandates collected $20 billion in net flows, which was four times the level in 2018, demonstrating investor interest in ESG (Sharma et al., 2020). Over 3000 institutional investors and service providers have joined the Principles of Responsible Investment (PRI), a commitment to include ESG factors in their analysis of investment and decision-making methods. However, financial institutions also use ESG rating as a lending decision tool. Their emphasis is not only to discourage companies with financial problems to keep their portfolios profitable, but there is also a rising interest in putting the burden on publicly traded companies to increase their ESG practices (Hisano et al., 2020).

The theoretical underpinnings of this study are grounded in stakeholder theory, agency theory, legitimacy theory, and signaling theory to illuminate the complexities of firm, industry and country level factors and their relationship to ESG (Baldini et al., 2018; Chung et al., 2023; Weber, 2014). Stakeholder theory underscores the positive connection between larger firm size and voluntary disclosure, driven by the aspiration to maintain harmonious stakeholder relationships and minimize environmental and social impacts (Ansoff, 1965). Signaling theory underscores factors such as firm size, auditor type, profitability, gearing, and industry dynamics as key determinants of disclosure, because firms strategically use information dissemination to address information asymmetry with external parties (Spence, 1978). Conversely, agency theory emphasizes ownership structure, specifically the impact of dispersed ownership on the need for robust information disclosure to align management and shareholder interests (Jensen & Meckling, 1976). The convergence of institutional and stakeholder theory sheds light on broader institutional pressures that influence firms to adopt sustainability reporting practices. This convergence fosters a comprehensive understanding of the motivations behind ESG-related disclosure decisions (DiMaggio & Powell, 1983). Legitimacy theory explains that there is a relationship between firm size and a firm’s intention to disclose voluntary information because of public pressure (Perrow, 1970).

Previous studies have shown the effect of ESG on different aspects of the firm, i.e., ownership characteristics, leadership characteristics, systematic risk, credit risk, firm profitability, and stock return (Billio et al., 2021; Broadstock et al., 2020; Đặng et al., 2022; Darnall et al., 2022; Drempetic et al., 2020; Nekhili et al., 2021; Sauerwald & Su, 2019; Sharma et al., 2020). Little attention has been given to developing ESG rating prediction methodology and theory (D’Amato et al., 2021, 2022). Prediction of an ESG score would offer different levels of advantage to investors, the environment and society. For example, an ESG prediction may advance the sustainable development of companies by identifying environmental issues, i.e., detecting emissions levels and energy consumption. Moreover, enhancing the arguments of corporate social performance (CSP) backgrounds, an ESG prediction will allow firm regulators to take the necessary action if firms’ ratings tend to become lower. Investors and portfolio managers will be able to make investment decisions based on the rating. Financial institutions will be able to make the lending decisions and predict their borrower ESG rating. Therefore, we aim to predict ESG ratings using firm and macroeconomic variables. Traditional econometrics, i.e., logistic regression and the generalized linear model, allow only binary classification. However, machine learning models aid multiclass classification, which is quite important in the finance field, such as credit rating, bankruptcy prediction (Abdullah, 2021; Barbeito-Caamaño & Chalmeta, 2020; Jiang et al., 2022; Kumar et al., 2022; Wang et al., 2022). This study applies six machine learning algorithms to develop an ESG prediction model using 73 countries’ 6166 firms’ data and finds the random forest classifier outperforms other models with 78.50% accuracy.

This study makes several contributions. First, the study contributes to the growing body of literature by adding evidence of ESG rating forecasting. Earlier studies mostly focused on ESG rating forecasting with a small sample (D’Amato et al., 2021, 2022, 2023), and ESG text classification (Lee & Kim, 2023; Sokolov et al., 2021). We fill the gap of ESG rating classification using global data and wide range of multiclass classification models. Second, our study extends the studies by D’Amato et al., (2021, 2022), who attempt to forecast ESG scores using a machine learning random forest regression model for a time series model using 109 (D’Amato et al., 2021) and 401 (D’Amato et al., 2022) firm-level data from STOXX Europe 600 Index from 2014 to 2018 (D’Amato et al., 2021) and 2009–2019 (D’Amato et al., 2022). Our study is more comprehensive and considers 6166 firms from 73 countries from 2005 to 2019 and uses six machine learning classification models for a robust ESG rating prediction model.

Third, the insights gained from the variable importance analysis provide a more in-depth understanding of the fundamental drivers behind ESG ratings. This discovery is a valuable resource for policymakers, practitioners, and regulatory bodies. It allows them to develop well-informed policies and strategies to maintain and improve ESG ratings. Finally, our findings are substantiated by well-established theories including the stakeholder theory, agency theory, legitimacy theory, and signaling theory. This emphasizes the importance of factors specific to each firm and country in determining ESG performance. As a result, our findings not only add to existing knowledge, but also provide useful insights for advancing theoretical frameworks in ESG studies. Our findings pave the way to develop an ESG score prediction model, which may be utilized by policymakers, practitioners, investors, regulatory bodies, portfolio managers, and stakeholders to obtain up-to-date information for policy development and investment decision-making.

The paper continues as follows: Sect. 2 provides a literature review. Section 3 elaborates the study’s data and methodology. Section 4 presents the empirical results. Section 5 has a detailed discussion of the results. Finally, Sect.  6 concludes this study with implications and future study directions.

2 Literature review

Corporation stakeholders have been consistently demanding assurance of “sustainability” throughout corporate operations (Goodell et al., 2021). This concern has forced corporations to adopt environmental, social and governance (ESG) practices as one of their core business functions (Crespi & Migliavacca, 2020). ESG practice, first introduced by the United Nation’s Principles of Responsible Investment—UNPRI (Sharma et al., 2020), has been widely recognized as a standard to measure sustainable performance. The ‘E’ stands for environmentally-related components such as water use, renewable/non-renewable resources, and emissions. The ‘S’ includes societal and community-related elements, including health and safety issues, workplace diversity, child labor, and labor strikes. Management and board components comprising board meetings, diversity, board attendance form the ‘G’ factor for Governance. The significance of ESG information disclosure can be observed through the lenses of corporations’ financial value. Such disclosure is used as a tool to fulfill multiple stakeholders’ expectations regarding business (Lokuwaduge & Heenetigala, 2017).

Early studies reveal that better ESG disclosure attracts both financial and non-financial stakeholders, resulting in enhanced firm financial structures (Henrique et al., 2019). Both types of stakeholder expect that firms maintain transparency in their company ESG information disclosure. Asset managers and investors, a type of financial stakeholder, increasingly integrate ESG materials into their investment decision-making (Friede et al., 2015). Since the ESG data are relevant to investment performance, financial stakeholders are paying more attention to firms with high ESG disclosure (Chauhan & Amit, 2014; Drempetic et al., 2020; Hörisch et al., 2015; Yu & Choi, 2016).

The literature also suggests that the determinants of ESG disclosure are derived from three relatively related theories: the institutional, accountability, and legitimacy theories (Baldini et al., 2018; Weber, 2014). From the institutional theory, an organization resorts to ESG practice because of pressure from institutions such as non-government and independent organizations. Such institutional or broader social structure determines the survival of the organization. The accountability perspective concerns organizations report on ESG issues because of their obligations to stakeholders. Central to the legitimacy theory is the organization’s fulfillment of society’s needs and desires. Thus, an organization tries to secure or increase legitimacy by prioritizing society’s requirements. One way to ensure legitimacy is for the organization to undertake sustainability management activities. Previous studies use that theoretical stance to identify a list of country- and firm-level ESG disclosure determinants. The country-level drivers include the country’s political, labor, and cultural system, and the firm-level drivers comprise the firm’s visibility to both investors and non-investors (Baldini et al., 2018).

Another stream of studies examines the relationship between ESG and sustainability. Buallay (2019) examine the association between ESG and European bank performance. They documented that ESG factors positively influence bank performance, suggesting that European banks are placing a greater focus on transparency, which in turn leads to improved financial performance. Using the Thomson Reuters ESG rating dataset, Drempetic et al. (2020) examine the impact of firm size on ESG rating and find a positive relationship between ESG and firm size. They concluded that larger corporations are more sustainable than smaller corporations, and larger corporations have a greater advantage in terms of focusing on reporting transparency. Escrig-Olmedo et al. (2019) analyzed the criteria of ESG ratings and how an ESG rating contributes to sustainable development. Their findings indicate that rating agencies should incorporate a broader range of sustainability indicators to ensure a comprehensive assessment of sustainable development.

Alsayegh et al. (2020) find ESG disclosure enhance corporate sustainability performance. They found that transparent ESG disclosure, delivering accurate information about a company, creates better prospects for building trust among stakeholders, ultimately resulting in enhanced company performance. Johnson (2020) investigates ESG data, its challenges, and proposed solutions for establishing a shared ESG language to effectively measure sustainable outcomes in global investments. Using the resource-based view and the stakeholder capitalism theory, Bhandari et al. (2022) find a concave shaped relationship between ESG and competitive advantage. They contends that current attributes of a resource base for sustained competitive advantage, based on the resource-based view, overlook the crucial aspect of “ESG friendliness” in a resource. Garcia et al. (2019) examined the relationship between ESG rating and firm financials and find firm market capitalization is a significant determinant of ESG performance, with larger companies generally exhibiting better performance. Additionally, firms in sensitive industries demonstrate strong environmental performance even after accounting for size and location.

Several mathematical tools and methods have been in use for prediction. Machine learning is becoming a widely used methodological tool to explain and forecast market trends in the financial industry (Henrique et al., 2019). It widely covers multiple facets of an algorithm for recognizing patterns and making decisions (Henrique et al., 2019). This tool, integrated with an artificial intelligence system, tries to extract patterns learned from historical data in a way regarded as training and learning. From traditional hedge fund managers to contemporary fintech service providers, all are extensively applying machine learning expertise (Holzinger et al., 2018). Today’s financial system, in other words, is supported by the generation of machine-readable data that act as a catalyst to disrupt and transform the financial industry.

An amount evidence exists regarding machine learning applications in the financial industry. Abdullah et al. (2023b) use a deep learning model with textual analysis for stock price forecasting. Barboza et al. (2017) combine traditional analytical tools (logistic and discriminant analysis) with the machine learning to analyze credit risk and bankruptcy issues. Ciampi et al. (2021) review SME-default prediction models. Abdullah et al. (2023a) apply machine learning models to forecast bank non-performing loans and find the superior performance of the random forest model. This list implies persistent growth of the application of machine learning tools and techniques in financial firms to explore multiple, diverse facets of the financial system.

In the implication of machine learning in ESG studies, there is growing body of literature that uses machine learning. Raman et al. (2020) use machine learning algorithms to examine earning call transcripts and document that ESG factors are vital for corporate policy. Unlike other studies, D’Amato et al. (2022) forecast ESG scores using STOXX Europe 600 index 401 firm data. They find the random forest model output better at forecasting the ESG score, indicating random forest model have better capacity to grasp nonlinear of the ESG model. Sokolov et al. (2021) develop an ESG rating model using deep learning that can predict the ESG rating using unstructured text. Lanza et al. (2020) propose a machine learning model to examine contradictions in ESG scores. They find information extracted from ESG factors significantly enhances the portfolio optimization model by providing better information about the companies’ capacity to handle climate change risk, specifically transition risk.

De Lucia et al. (2020) develop a European firm financial performance prediction model using ESG indicators and macro-economic factors and document a positive relationship with firm financial performance and ESG practices. They argued that company-specific conditions establish a structural framework that connects practices with performance, and this correlation is influenced by a company’s growth opportunities, which in turn are heightened by higher levels of transparency and reporting. Antoncic (2020) finds that big data analytics can avoid bias in ESG ratings and can eliminate greenwashing. Recently, Lee and Kim (2023) developed an ESG text classifier to extract ESG related information from Korean sustainability reports. Their proposed natural language processing model can achieve 86.66% accuracy.

Overall, the above literature review suggests that studies are mostly based on stakeholder, agency, legitimacy, and signaling theories. There is growing body of literature that uses machine learning and other big data analytics approaches in ESG related studies. However, studies are yet to be conducted to discover the implications of machine learning in ESG rating predictions using multiple countries’ data.

3 Data and methodology

We aim to develop a machine learning-based ESG rating prediction model that policymakers and stakeholders can apply using a combination of firm-specific and macroeconomic predictors. The paper’s framework is presented in Fig. 1. It elaborates the feature selection, data collection, data cleaning, feature elimination, model selection, and cross-validation processes.

Fig. 1
figure 1

Classification framework used in this study of machine learning models

3.1 Data and pre-processing

Before data collection, the authors conducted a rigorous review of the literature to determine the related ESG score predictors. The selection of variables is motivated from the theoretical foundation of the stakeholder, agency, legitimacy, and signaling theories. Predictors are then categorized into firm-specific and macroeconomic predictors of the ESG rating. Table 1 presents the selected predictors with ESG categorical rating as the main outcome variable. Refinitiv captures and calculates over 500 business-level ESG measures for ESG score calculation. It shortlists 186 of the most comparable metrics per industry driving a company’s overall appraisal and scoring process. Refinitiv offers a company rating that ranges from A + to D− and is divided into 12 categories. However, we modify their standard rating system into four rating categories A, B, C, and D. The descriptions are presented in Table S.1. A total of 12 predictors are selected where eight predictors are chosen in the firm-specific predictor category, and four are selected in the macroeconomic predictor category.

Table 1 Specification of the variables used in the models

Data of ESG ratings and firm-specific predictors were collected from the Thomson Reuters DataStream database. Thomson Reuters, ESG rating report has 7096 firms’ data in the most recent report. Fifteen fiscal year data from 2005 to 2019 were collected to consider before and after the financial crisis. At first, observations with missing values are excluded from the dataset, which produce an unbalanced panel data set of 45,175 observations from 6166 firms (see details in Table S.2). Outliers were WinsorizedFootnote 1 to produce the final shape of data as in Fig. 2. The Recursive Partitioning and Regression Trees (RPART) algorithm was applied for feature elimination by calculating the variable importance factor (VarImp).Footnote 2 VarImp results are presented in Table 1 (VarImp low-value variables are removed from the dataset). Therefore, ROA, ROE, NPM, and INF are removed from the dataset. The dataset is then split into training and testing datasets by considering 70% of the data (2005–2014) for training and the rest, 30% (2015–2019), for the out-of-sample testing.

Fig. 2
figure 2

Continent, year and industry of environment, society and governance scores’ distribution

3.2 Methodology and hyperparameter optimization

Our methodology involves multiclass classification, which is different from binary classification and cannot be modeled using conventional methods (Abdullah, 2021; Barbeito-Caamaño & Chalmeta, 2020; Jiang et al., 2022; Kumar et al., 2022; Wang et al., 2022). Therefore, we use multiclass classification machine learning models. There is a wide range of machine learning models available for classification (Kotsiantis et al., 2006). Following the literature (Abdullah et al., 2023a; Barboza et al., 2017; D’Amato et al., 2022), we chose the Artificial Neural Networks Classifier, Bagging Classifier, k-Nearest Neighbors Classifier, Naive Bayes Classifier, Random Forest Classifier, and Support Vector Machines Classifier.

These selected models have proven to be superior in several financial classification problems, e.g., bankruptcy prediction, credit rating, and credit default prediction (Abdullah et al., 2023a; Barboza et al., 2017; D’Amato et al., 2022). The motivation behind choosing multiple models is to select the best model for ESG rating prediction from these widely used models. We applied the R statistical software caret packages for machine learning classification model training and testing.Footnote 3 Selected classifiers and their base learners are listed in Table 2. HyperparametersFootnote 4 are optimized by ten times cross-validation and five repetitions. Other tuning parameters and model details are discussed as follows:

Table 2 Selected machine learning classifiers and base learners

3.2.1 Artificial neural networks classifier (ANNC)

Artificial neural networks are simple electronic networks of neurons based on the human brain’s neural structure. They go through each observation one at a time, learning by comparing their categorization to the real categorization of the observation (May et al., 2010). The first pool of observation classification errors is passed back into the network and used to tweak the network’s algorithm for subsequent iterations. Neurons are structured into three layers: input, hidden, and output layers. The algorithm starts with the input layer where there are only a few neurons that take input. They then pass them to the hidden layer. Finally, observations go to the output layer according to their classification. Equations (1) and (2) illustrate the ANNC, where \(X\) is the input matrix, and \(W\) is the weight matrix. Moreover, \({z}_{k}\) defines the output function where \(nH\) denotes the number of perceptions in the hidden layer and w0 denotes the errors. For this study, the neural network is taken as the base learner with size = 10 and decay = 0.01 as hyperparameter tuning after a grid search.

$$ f\left( X \right) = \left\{ {\begin{array}{*{20}l} 1 &\qquad {W^{T} X + w_{0} > 0} \\ 0 &\qquad {W^{T} X + w_{0} \le 0} \\ \end{array} } \right. $$
(1)
$$ z_{k } = f(\mathop \sum \limits_{j = 1}^{nH} w_{kj} f\left( {\mathop \sum \limits_{i = 1}^{d} x_{i} w_{ji} + w_{j0} } \right) + w_{k0} $$
(2)

3.2.2 Bagging classifier (BGC)

A bagging classifier is an ensemble algorithm that assigns base classifiers to random subgroups of the main dataset and then combines their distinct predictions to generate a final prediction. This type of algorithm reduces error by introducing random inputs for prediction. Each base classifier of the bagging classifier trains multiple models by parallel taking a different sample from the training dataset (Breiman, 1996). Equation (3) presents the construct of the bagging classifier here, where \(\varphi (x,L)\) denotes predictors, and \(j\) denotes the class. For this study, Bagged CART is selected as the base learner of Bagging Classifier; no hyperparameter is available for this learner.

$$ Q(j|x) = P(\varphi \left( {x,L} \right) = j $$
(3)

3.2.3 k-Nearest neighbors classifier (KNNC)

K-nearest neighbors is one of machine learning’s most basic but crucial classification algorithm. KNNC is a straightforward non-parametric technique used for classification and regression. In the k-nearest neighbors algorithm, different categorical classes are assigned corresponding to the prediction features of its k-nearest neighbor. Keller et al. (1985) proposed the first testing version of KNNC and named it the ‘‘fuzzy k-nearest neighbors classifier algorithm’’Click or tap here to enter text. According to this methodology, all datasets are distributed to a different class. The formula for this method is:

$$ u_{i} = \frac{{\mathop \sum \nolimits_{j = 1}^{k} u_{ij} \left( {1/\left\| {x - x_{j} } \right\|^{{2/\left( {m - 1} \right)}} } \right)}}{{\mathop \sum \nolimits_{j = 1}^{k} \left( {1/\left\| {x - x_{j} } \right\|^{{2/\left( {m - 1} \right)}} } \right)}} $$
(4)

Here \(i= \mathrm{1,2},..,n\), and \(j=\mathrm{1,2},..,k\) where \(n\) is the class number and \(k\) is the nearest neighbor. Here, \(m\) is denoted as a fuzzy strength parameter and is used to determine how strongly the relationship is weighted at the time of calculation of each neighbor’s contribution to the group value. The term \({u}_{ij}\) is the group degree of the model, \({x}_{i}\) belongs to the training set of class \(i\), between the \(k\) nearest neighbors of \(x\). The following tuning parameters are used for final model training after grid search: distance = 2, kmax = 9, and kernel = optimal.

3.2.4 Naive Bayes classifier (NBC)

The Naive Bayes Classifier is classified as a probabilistic classifier that incorporates Bayes’ hypothesis to achieve reliable predictions among input variables. The architecture of NBC consists of nodes and edges, wherein the nodes symbolize the inputs, and the edges establish connections between nodes leading to the output. This configuration forms a directed acyclic graph (Pearl, 1988). A different set of conditional probabilities works as a probabilistic input in the NBC model. For example, the credit rating of the company model will be trained based on previously rated companies, and NBC will output the rating by predicting the likelihood of the prediction class.

$$ P(A | B) = P(B | A) P\left( A \right)/ P\left( B \right) $$
(5)

Equation (5) demonstrates Bayes’ main theory. Here, \(P(A| B)\) is the function of two inputs that represent the proposition of probabilities. \(P(B | A)\) represents the likelihood and \(P(A)\) is a function of class probability and, finally, \(P (B)\) represents the prediction probabilities. The Naive Bayes Classifier model assumes the following condition of independence:

$$ P_{i} \bot \left\{ {P_{1} ,P_{2,} \ldots ,P_{i} - 1,P_{i} + 1, \ldots .P_{n} } \right\}| V $$
(6)

where: \(i = 1, 2, \ldots , n\). This theory declares that predictors, \(P_{1} ,P_{2,} \ldots ,P_{n}\) are conditionally equally autonomous for prediction (Sun & Shenoy, 2007); fL = 0, usekernel = TRUE and adjust = 1 parameters are used for hyperparameter tuning after the grid search.

3.2.5 Random forest classifier (RFC)

The random forest classifier algorithm is constructed with many distinct decision trees that work together as an ensemble. Each tree in the random forest produces a distinct class prediction. The class with the maximum importance output is the prediction (Ho, 1995). The following equation constructs the random forest importance model as:

$$ ni_{j} = w_{j} c_{j} - W_{L\left( j \right)} C_{L\left( j \right)} - W_{R\left( j \right)} C_{R\left( j \right)} $$
(7)

Here, \(j\) denotes each node of the tree, \(ni\) is the importance function, \(w_{j}\) denotes weight, \(C_{j}\) denotes nodes impurity, \(L\left( j \right)\) and \(R\left( j \right)\) denotes left and right node split. In this study mtry = 2 and ntree = 550 is used as final hyperparameter.

3.2.6 Support vector machines classifier (SVMC)

This classifier is a hybrid statistical learning model. Cortes and Vapnik (1995) introduced support vector machine models in 1995. They defined SVM as converting the original data into high-dimensional data to create prediction classesClick or tap here to enter text. Support vector machines are suitably straightforward to examine them statistically. SVMC aims to identify a hyperplane in a dimensional sample space that accurately classifies data points. Hence, the support vector machine is constructed as:

$$ \frac{1}{2}w^{T} w + c \mathop \sum \limits_{i = 1}^{M} \in_{i} $$
(8)
$$ y_{i} \left[ {w^{T} \varphi \left( {x_{i} } \right) + b} \right] \ge 1 - \in_{i} $$
(9)

where \(i = 1,2,3 \ldots ,N\), \(\xi i \ge 0\) are the errors associated with classification cost; \(c\), \(y_{i}\) are the outcomes variables; and \(\varphi \left( x \right)\) is transform space. We apply the SVMC algorithm as a base learner with C = 1 used for hyperparameter tuning after the grid search.

3.3 Performance metrics

Models are compared using different performance metrics after model training and testing. We use logLoss, AUC, prAUC, Accuracy, Kappa, F1, Sensitivity, and Specificity to measure model performance. Sensitivity measures a model’s power to predict true positive rates through all classes and specificity measures a model’s power to predict true negative rates across all classes. Equations (10) and (11) depict the sensitivity and specificity:

$$ Sensitivity = NTP/\left( {NTP + NFN} \right) $$
(10)
$$ Specificity = NTN/\left( {NTN + NFN} \right) $$
(11)

where the number of true positive classifications correctly classified is NTP, and the number of true negative classifications correctly classified is NTN. NFN is the number of false-positive classifications, and NFP is the number of false-positive classifications from Eq. (11). If the classification error is low, the sensitivity and specificity values will be close to one.

The efficiency of a machine learning classification model with a binary or multiclass prediction class is measured by logarithmic loss (logLoss). The primary objective of a machine learning model is to reduce the logLoss value as much as possible. When the logLoss is close to 0, the model is considered perfect. The area under the receiver operating characteristic (ROC) curve are represented by AUC. The ROC curve is a graphical representation of a classifier’s discriminant threshold value. A relatively high AUC value indicates that the model has a high predictive power. The Precision recall AUC (prAUC) determines the predicted positive class percentage. Cohen’s Kappa statistics, which indicate the degree of good model fitness, is an enhanced indicator for multiclass predictive modeling. According to Landis and Koch (1977), the value of Kappa can be classified into six categories: (i) 0 percent suggests that the model is not a good fit; (ii) 0 percent to 20 percent indicates low significance; (iii) 21 percent to 40 percent indicates fair significance; (iv) 41 percent to 60 percent indicates moderate significance; (v) 61 percent to 80 percent suggests considerable significance; and (vi) 81 percent indicates significant significance.

4 Empirical results

4.1 Exploratory data analysis results

First, the descriptive statistics are applied to confirm the validity of the dataset. The results are presented in Table 3. The different sectors and continents plot the yearly ESG score in Fig. 2. Figure 2 indicates that most of Thompson’s companies rated in the ESG rating are from North America. There are fewer observations from the dataset start period (2005–2007). Also, the amount of A category companies is very low compared with other rating scales. Nevertheless, the descriptive statistics results in Table 3 indicate low variation among the predictors throughout the full training and testing dataset. However, there is high variation in LESG, and TIE among all datasets. Except for SIZE and GDP, the predictors are positively skewed.

Pearson correlation analysis is used to validate the correlations among the predictors; the results are presented in Table 3. The main study outcome variable is ESG rating, a categorical variable that is removed from the correlation analysis. However, we consider the lagged effect of the ESG score (LESG) that describes possible correlations among the variables. The results indicate that there is a statistically significant correlation among all variables with LESG. However, except for GDP, TIE, and GDPG that are negatively correlated, all variables are positively correlated with LESG. These statistical indicators define the validity of the training and testing dataset. Thus, we proceed with further analysis by developing six machine learning models.

Table 3 Descriptive statistics and correlation analysis of the firm sample and sub-samples

4.2 Model training and testing results

Six machine learning models were trained and validated on confirming the dataset rationality to develop the machine learning models. Figure 3 illustrates all models’ confusion matrix. The results of the ANNC confusion matrix show that there is less misclassification. For (i) ESG rating A, there are 743 actual observations in the testing dataset and the model predicted 816; (ii) for B, there are 3409 actual observations, whereas the model predicted 3340; (iii) for C, there are 5676 observations and the ANNC model predicted 5638; and iv) for rating D the model predicted 3759 where there are 3725 observations. Thus, these confusion matrix results indicate there is low prediction error. Other models output similar results from the confusion matrix. However, it is hard to determine which model is the best for ESG rating prediction with a confusion matrix. Therefore, we analyze the performance metrics of each model.

Fig. 3
figure 3

Confusion matrix of all machine learning models

All machine learning models’ training and testing performance metrics are presented in Table 4. The models’ training phase shows that the random forest classifier has the highest accuracy (78.3%). The testing phase accuracy output is a little bit higher at 78.5% for the random forest classifier. The rest of the models produce similar results, but other performance metrics should be considered before selecting RFC as the best model. The logLoss value of the bagging and k-nearest neighbor classifier models is over 1, indicating a higher error rate. However, other models have lower logLoss values; RFC has the lowest logLoss value for the training and testing phase (Abdullah, 2021; Bishop, 1995; Ho, 1995).

Table 4 Training and testing results of seven different machine learning models

The area under the curve (AUC) value also supports that RFC is the best model because the AUC value is 94.1%; this is the highest value among all models. AUC values of all models are plotted in the ROC curve in Fig. 4. The plots show that all models have similar areas under the curve since all lines are near each other. The testing phase outputs provide results comparable with the training phase. The Kappa value of the RFC model is highest with a value of 68.2%, which indicates considerable significance for the training and testing phase (Abdullah, 2021; Ciampi et al., 2021). Sensitivity and specificity values are near zero, indicating a low misclassification rate. The F1 scores of all models are near one, indicating low false-positive and low false negative errors in the predictions. Interestingly, the artificial neural network model took the lowest training time of 119.162 min, and the random forest took the longest time of 221.669 min. This outcome is expected because this study used multiclass classification, leading to higher splits of trees for the random forest algorithm (D’Amato et al., 2022; Krauss et al., 2017; Priya et al., 2018).

Fig. 4
figure 4

The receiver operating characteristic curve of all machine learning models

Overall, the results indicate that the random forest classifier (RFC) outperforms all other machine learning models in ESG rating prediction. We have robust results across the training and testing phases of machine learning model development. Considering all the performance metrics, such as logLoss, AUC, prAUC, Accuracy, Kappa, F1, Sensitivity, Specificity, and ROC curve values, random forest classifier is the superior machine learning approach for ESG score prediction. Many previous studies find greater accuracy with the random forest machine learning algorithms in other different financial issues (D’Amato et al., 2022; Krauss et al., 2017; Priya et al., 2018).

4.3 Variable importance results

Based on the fact that the random forest classifier outperforms all models, Table 5 presents the result for variable importance factor of the random forest classifier to examine how much each variable contributes to the model (Archer & Kimes, 2008). The results indicate the lagged ESG score is the top contributing variable, firm size is the second-highest and financial leverage is the third-highest. Yu and Choi (2016) also find firm size and financial leverage are important factors in ESG ratingsClick or tap here to enter text. Macroeconomic variables are the lowest contributors to the model.

Table 5 The importance of the variable in the random forest classifier model

4.4 Robustness test results

To examine the robustness of our model, we used different sets of training–testing splits. Our initial analysis was based on a 70% (train) and 30% (test) sample division. Following the approach of Abdullah et al. (2023a), we also used 50:50 and 90:10 test splits to assess both the stability and performance of our models. Table 6 presents the results of the robustness test using different training–testing splits. Our findings indicate that the RFC consistently achieves the greatest accuracy across all models. Other performance metrics, such as AUC and Kappa, also confirm the superior performance of the RFC.

Table 6 Robustness across different training and testing splits of the machine models

These results further reinforce our initial findings based on the 70:30 train-test split, highlighting the RFC as the superior machine learning approach for predicting ESG scores. These outcomes are consistent with earlier studies that demonstrate RFC model outperforms others in the context of various financial modelling (Abdullah et al., 2023a; D’Amato et al., 2022; Krauss et al., 2017; Priya et al., 2018). A comparison with related studies highlights the best-performing model (see Table 7). The table shows the prevailing accuracy of the RFC model, both in ESG analysis and across broader fields of finance-related research. The efficacy of the RFC’s performance can be attributed to the nature of its algorithm. By creating and pruning trees, this algorithm adeptly addresses multi-class classification challenges, such as our task of predicting ESG ratings.

Table 7 Machine learning model performance in related studies

We then performed a parallelized evaluation using our initial grid of hyperparameters to diagnose the performance of the RFC model. Our grid search included a range of “mtry” (from 1 to 8) and “ntree” (from 100 to 600) values, all of which were applied to our baseline sample (70:30 split). Figure 5 illustrates the RFC’s accuracy results. The results show a decreasing trend in accuracy as the number of mtrys increases. This implies that there may be a trade-off between accuracy and complexity introduced by higher mtry values. The behavior of ntree is variable, with instances of both increase and decrease as mtry values increase. The peak accuracy, 78.5%, was achieved when mtry was set to 2 and ntree was set to 550. This configuration represents a compromise in which the model’s accuracy is optimized while taking into account the number of features and decision trees used.

Fig. 5
figure 5

Random Forest Classifier model: An accuracy across grid search

Figure 6 presents the Kappa results from the RFC. The findings again highlight the highest Kappa value is when mtry is set to 2 and ntree is set to 550. The Kappa statistic is a concordance measure that considers the possibility of concordance occurring by chance. In our context, a high Kappa value indicates a high level of agreement between predicted and actual ESG ratings, especially when the mtry and ntree values are considered.

Fig. 6
figure 6

Random Forest Classifier model: Kappa across grid search

Figure 7 illustrates the AUC results from the RFC. Again, the results highlight that the highest AUC value is when mtry is set to 2 and ntree is set to 550. In our context, the AUC metric indicates the model’s ability to discriminate between different classes, which translates into its effectiveness in distinguishing various ESG ratings. An increased AUC value emphasizes RFC’s robust discriminatory power, especially when the mtry and ntree settings are set to 2 and 550, respectively. Overall, the above results reinforce the RFC’s efficacy in accurately predicting ESG ratings under these specific parameter conditions.

Fig. 7
figure 7

Random Forest Classifier model: Area under the curve across grid search

5 Discussion

Over the last decade, environmental, social, and governance (ESG) rating has been becoming a key indicator for investment decisions by investors, i.e., institutional investors, portfolio managers, rational investors, and firm mangers to assess sustainable development goals. We try to predict ESG ratings using firm and macroeconomic variables by applying six machine learning algorithms. We find that the RFC is the superior algorithm for predicting ESG ratings among the other algorithms considered (the logLoss, AUC, prAUC, Kappa, F1, Sensitivity, Specificity, and ROC curve). The accuracy results indicate that the RFC achieves an ESG rating accuracy of 78.5%.

The primary discovery of this study is that the ESG score from the preceding year exerts the most influential effect on forecasting increasing ESG ratings. This indicates that ESG ratings are greatly influenced by how well a company performed in the previous year. This finding aligns with ideas from institutional, accountability, and legitimacy theories (Baldini et al., 2018; Weber, 2014). It also suggests that the increasing attention paid to ESG ratings is motivating company leaders to either maintain their previous year’s rating or work on improving it. In this way, they can uphold or even enhance the positive perception of their commitment to sustainability among investors and stakeholders.

Moreover, firm size is the second-highest contributing variable to ESG rating prediction. Grounded in the findings of Drempetic et al. (2020), Chauhan et al. (2014), and Hörisch et al. (2015), big firms with higher liquidity tend to invest more in ESG. Larger firms may be more aware of sustainability than smaller firms because they cause more damage to the environment. This viewpoint is consistent with the ideas advanced by the stakeholder and signaling theories. According to these theories, larger corporations make an effort to maintain positive relationships with various stakeholders and to reduce their environmental and social footprint (Ansoff, 1965). Larger firms frequently use their growth to signal their potential for future success. This means they may see investing in ESG practices as a way to demonstrate their commitment to sustainable practices, which can help their reputation and prospects (Spence, 1978).

Furthermore, we document that leverage emerges as the third-highest contributing variable in ESG rating predictions. Our result is in line with the findings of Yu and Choi (2016), which suggest that leverage is a significant determinant of ESG scores. This finding can be interpreted through the lens of the agency theory, which sheds light on this relationship by emphasizing the importance of ownership structure (Jensen & Meckling, 1976). It emphasizes that when ownership is distributed among shareholders, there is a greater need for transparent information disclosure to align the interests of management and shareholders. In essence, highly leveraged firms appear to be motivated by a desire to demonstrate their commitment to sustainability and a desire to maintain clear communications with various stakeholders, including investors. This intertwining of financial structure and ESG practices emphasizes the complex relationships that exist between corporate decisions and external pressures.

6 Concluding remarks

6.1 Conclusion and policy implications

With rising stakeholder awareness of firms’ roles in society and an interest in social, environmental, and ethical issues, firm managers are under pressure from the public and private sectors to disclose their practices. ESG is also evolving as an investment decision tool for institutional investors and portfolio managers. This study develops ESG rating predictions using a machine learning approach. We use six machine learning classification models for ESG prediction with data on 6171 firms from 2005 to 2019. Of all models considered (RFC, logLoss, AUC, prAUC, Kappa, F1, Sensitivity, Specificity, and ROC curve), the RFC model demonstrates superior performance across various metrics. The accuracy achieved by this model is 78.50%. Our results consistently hold under robustness testing with different train-test splits and hyperparameter tuning. This suggests that the RFC model is the most suitable choice to predict ESG ratings.

The study’s results have several significant implications. First, the significance of ESG ratings in investment decisions highlights the importance of consistent and reliable assessment methodologies. Given that our research identifies the RFC as the most effective algorithm for predicting ESG ratings, regulators and rating agencies may consider endorsing or adopting this algorithm to ensure consistent and accurate evaluations across industries. Moreover, embracing machine learning tools into enterprise resource planning systems, may enable companies to systematically track their ESG ratings over time. The transparency and comparability in ESG ratings could be improved by establishing a preferred algorithm, assisting investors in making informed decisions aligned with their sustainable development goals. Second, the significant impact of the previous year’s ESG score on future ratings emphasizes the importance of sustained commitment to ESG practices. The policymakers and industry groups could encourage companies to view ESG ratings as continuous improvement targets rather than static assessments. Recognizing the positive correlation between prior-year performance and future ratings, companies may be incentivized to focus on maintaining or improving their ESG scores, fostering a culture of sustainable practices, and enhancing their reputation among stakeholders.

Third, our findings highlight the relationship between firm size, liquidity, and ESG investing. Policymakers and industry regulators could encourage larger corporations to adopt ESG practices through targeted incentives and reporting requirements. Encouraging large corporations to invest proactively in ESG practices can have a positive impact on sustainability, given their significant influence on the environment and society. These corporations can demonstrate leadership by aligning growth strategies with sustainability goals, thereby contributing to a more sustainable business landscape. Fourth, the relationship between leverage and ESG scores reveals a way for businesses to demonstrate their commitment to sustainability. Regulatory bodies can promote greater transparency in financial disclosures related to leverage, encouraging companies to be accountable for their financial structures and ESG practices. Aligning financial and sustainability goals could encourage responsible behavior and strategic decision-making that benefits both investors and larger stakeholder groups. Finally, regulatory bodies and policymakers may consider the study’s conclusions when formulating policies related to ESG performance and disclosure, thereby cultivating a more sustainable and responsible corporate environment. By leveraging our ESG score prediction model, policymakers, practitioners, investors, regulatory bodies, portfolio managers, and stakeholders may develop their prediction models, facilitating the acquisition of current information for policy formulation and informed investment choices.

6.2 Limitations and future research directions

This study’s focus is clearly on ESG rating prediction based on Thomson Reuters ESG rating data, which lack consideration of most emerging nations’ firms. Future studies can contribute by developing a new model that considers more predictors to develop an ESG rating model for all public-listed firms not listed in Thomson Reuters. This will allow decision-makers to assess all public-listed firms’ ESG ratings. Future studies can extend our findings by developing an ESG prediction model for small and medium enterprises.