Introduction

The rapid development of information and database technologies, coupled with notable progress in data analysis methods and computer hardware, has led to an exponential increase in the application of ML techniques in various areas, including business and finance (Ghoddusi et al. 2019; Gogas and Papadimitriou 2021; Chen et al. 2022; Hoang and Wiegratz 2022; Nazareth and Ramana 2023; Ozbayoglu et al. 2020; Xiao and Ke 2021). The progress in ML techniques in business and finance applications, such as marketing, e-commerce, and energy, has been highly successful, yielding promising results (Athey and Imbens 2019). Compared to traditional econometric models, ML techniques can more effectively handle large amounts of structured and unstructured data, enabling rapid decision-making and forecasting. These benefits stem from ML techniques’ ability to avoid making specific assumptions about the functional form, parameter distribution, or variable interactions and instead focus on making accurate predictions about the dependent variables based on other variables.

Exploring scientific databases, such as the Thomson Reuters Web of Science, reveals a significant exponential increase in the utilization of ML in business and finance. Figure 1 illustrates the outcomes of an inquiry into fundamental ML applications in emerging business and financial domains over the past few decades. Numerous studies in this field have applied ML techniques to resolve business and financial problems. Table 1 lists some of their applications. Boughanmi and Ansari (2021) developed a multimodal ML framework that integrates different types of non-parametric data to accommodate diverse effects. Additionally, they combined multimedia data in creative product settings and applied their model to predict the success of musical albums and playlists. Zhu et al. (2021) asserted that accurate demand forecasting is critical for supply chain efficiency, especially for the pharmaceutical supply chain, owing to its unique characteristics. However, a lack of sufficient data has prevented forecasters from pursuing advanced models. Accordingly, they proposed a demand forecasting framework that “borrows” time-series data from many other products and trains the data with advanced ML models. Yan and Ouyang (2018) proposed a time-series prediction model that combines wavelet analysis with a long short-term memory neural network to capture the complex features of financial time series and showed that this neural network had a better prediction effect. Zhang et al. (2020a, b) employed a Bayesian learning model with a rich dataset to analyze the decision-making behavior of taxi drivers in a large Asian city to understand the key factors that drive the supply side of urban mobility markets.

Fig. 1
figure 1

Trend of articles on applied ML techniques in business and finance (2007–2021)

Table 1 ML techniques applied in the business and finance domains

Several review papers have explored the potential of ML to enhance various domains, including agriculture (Raj et al. 2015; Coble et al. 2018; Kamilaris and Prenafeta-Boldu 2018; Storm et al. 2020), economic analysis (Einav and Levin 2014; Bajari et al. 2015; Grimmer 2015; Nguyen et al. 2020; Nosratabadi et al. 2020), and financial crisis prediction (Lin et al. 2012; Canhoto 2021; Dastile et al. 2020; Nanduri et al. 2020). Kou et al. (2019) conducted a survey encompassing research and methodologies related to the assessment and measurement of financial systemic risk that incorporated various ML techniques, including big data analysis, network analysis, and sentiment analysis. Meng and Khushi (2019) reviewed articles that focused on stock/forex prediction or trading, where reinforcement learning served as the primary ML method. Similarly, Nti et al. (2020) reviewed approximately 122 pertinent studies published in academic journals over an 11-year span, concentrating on the application of ML to stock market prediction.

Despite these valuable contributions, it is worth noting that the existing review papers primarily concentrate on specific issues within the realm of business and finance, such as the financial system and stock market. Consequently, although a substantial body of research exists in this area, a comprehensive and systematic review of the extensive applications of ML in various aspects of business and finance is lacking. In addition, existing review articles do not provide a comprehensive review of common ML techniques utilized in business and finance. To bridge the aforementioned gaps in the literature, we aim to provide an all-encompassing and methodological review of the extensive spectrum of ML applications in the business and finance domains. To begin with, we identify the most commonly utilized ML techniques in the business and finance domains. Then we introduce the fundamental ML concepts and frequently employed techniques and algorithms. Next, we systematically examine the extensive applications of ML in various sub-domains within business and finance, including marketing, stock markets, e-commerce, cryptocurrency, finance, accounting, credit risk management, and energy. We critically analyze the existing research that explores the implementation of ML techniques in business and finance to offer valuable insights to researchers, practitioners, and decision-makers, thereby facilitating better-informed decision-making and driving future research directions in this field.

The remainder of this paper is organized as follows. Section “Keywords, distribution of articles, and common technologies in the application of ML techniques in business and finance” outlines the literature retrieval process and presents the statistical findings from the literature analysis, including an analysis of common application trends and ML techniques. Section “Machine learning: a brief introduction” introduces fundamental concepts and terminology related to ML. Sections “Supervised learning” and “Unsupervised learning” explore in-depth common supervised and unsupervised learning techniques, respectively. Section “Applications of machine learning techniques in business and finance” discusses the most recent applications of ML in business and finance. Section “Critical discussions and future research directions” discusses some limitations of ML in this domain and analyzes future research opportunities. Finally, “Conclusions” section concludes.

Keywords, distribution of articles, and common technologies in the application of ML techniques in business and finance

The primary focus of this review is to explore the advancements in ML in business- and finance-related fields involving ML applications in various market-related issues, including prices, investments, and customer behaviors. This review employs the following strategies to identify existing literature. Initially, we identify relevant journals known for publishing papers that utilize ML techniques to address business and finance problems, such as the UTD-24. Table 2 lists the keywords used in the literature search. During the search process, we input various combinations of ML keywords and business/finance keywords, such as “support vector machine” and “marketing.” By cross-referencing the selected journals and keywords and thoroughly examining the citations of highly cited papers, we aimed to achieve a comprehensive and unbiased representation of the current literature.

Table 2 Keywords

After identifying journals and keywords, we searched for articles in the Thomson Reuters Web of Science and Elsevier Scopus databases using the same set of keywords. Once the collection phase was complete, the filtering process was initiated. Initially, duplicate articles were excluded to ensure that only unique articles remained for further analysis. Subsequently, we carefully reviewed the full text of each article to eliminate irrelevant or inappropriate items and thus ensure that the final selection comprised relevant and meaningful literature.

Figure 2 illustrates the process of article selection for the review. In the identification phase, we retrieved 154 articles from the search and identified an additional 37 articles through reference checking. During the second phase, duplicates and inappropriate articles were filtered out, resulting in a total of 68 articles eligible for inclusion in this study. Based on the review of these articles, we categorized them into seven different applications: stock market, marketing, e-commerce, energy marketing, cryptocurrency, accounting, and credit risk management, as depicted in Fig. 3 and Tables 3, 4, 5, 6, 7, 8 and 9. Statistical analyses have revealed that ML research in the business and finance domain is predominantly concentrated in the areas of stock market and marketing. The research on e-commerce, cryptocurrency, and energy market applications is nearly equivalent in quantity. Conversely, articles focusing on accounting and credit risk management applications are relatively limited. Figure 4 provides a summary of the ML techniques employed in the reviewed articles. Deep learning, support vector machine, and decision tree methods emerged as the most prominent research technologies. In contrast, the application of unsupervised learning techniques, such as k-means and reinforcement learning, were less common.

Fig. 2
figure 2

Flow diagram for article identification and filtering

Fig. 3
figure 3

Number of papers employing ML techniques

Fig. 4
figure 4

Prominent methods applied in the business and finance domains

Machine learning: a brief introduction

This section introduces the basic concepts of ML, including its goals and terminology. Thereafter, we present the model selection method and how to improve the performance.

Goals and terminology

The key objective in various scientific disciplines is to model the relationships between multiple explanatory variables and a set of dependent variables. When a theoretical mathematical model is established, researchers can use it to predict or control desired variables. However, in real-world scenarios, the underlying model is often too complex to be formulated as a closed-form input–output relationship. This complexity has led researchers in the field of ML to focus on developing algorithms (Wu et al. 2008; Chao et al. 2018). The primary goal of these algorithms is to predict certain variables based on other variables or to classify units using limited information; for example, they can be used to classify handwritten digits based on pixel values. ML techniques can automatically construct computational models that capture the intricate relationships present in available data by maximizing the problem-dependent performance criterion or minimizing the error term, which allows them to establish a robust representation of the underlying relationships.

In the context of ML, the sample used to estimate the parameters is usually referred to as a “training sample,” and the procedure for estimating the parameters is known as “training.” Let N be the sample size, k be the number of features, and q be the number of all possible outcomes. ML can be classified into two main types: supervised and unsupervised. In supervised learning problems, we know both the feature \({\mathbf{X}}_{i} = (x_{i1} ,...,x_{ik} ),\; \, i = 1,2,...,N\) and the outcome \(Y_{i} = (y_{i1} ,y_{i2} ,...,y_{iq} )\), where \(y_{ij}\) represents the outcome of \(y_{i}\) in the dimension \(j\). For example, in a recommendation system, the quality of product can be scored from 1 to 5, indicating that “q” equals 5. In unsupervised learning problems, we only observe the features \({\mathbf{X}}_{i}\) (input data) and aim to group them into clusters based on their similarities or patterns.

Cross-validation, overfitting, and regularization

Cross-validation is frequently used for model selection in ML that is applied to each model; the technique is applied to each model and the one with the lowest expected out-of-sample prediction error is selected.

The ML literature shows significantly higher concern about overfitting than the standard statistics or econometrics literature. In the ML community, the degrees of freedom are not explicitly considered, and many ML methods involve a large number of parameters, which can potentially lead to negative degrees of freedom.

Limiting overfitting is commonly achieved through regularization in ML, which controls the complexity of a model. As stated by Vapnik (2013), the regularization theory was one of the first signs of intelligent inference. The complexity of the model describes its ability to approximate various functions. As the complexity increases, the risk of overfitting also increases, whereas less complex and more regularized models may lead to underfitting. Regularization is often implemented by selecting a parsimonious number of variables and using specific functional forms without explicitly controlling for overfitting. Instead of directly optimizing an objective function, a regularization term is added to the objective function, which penalizes the complexity of the model. This approach encourages the model to generalize better and avoids overfitting by promoting simpler and more interpretable solutions.

Here, we provide an example to illustrate how regularization works. The following linear regression model was used:

$$y_{ij} = \sum\limits_{p = 1}^{k} {b_{pj} x_{ip} + \sigma_{j} }$$
(1)

where N is the sample size, k is the numbers of features, and q is the number of all possible outcomes. The variable \(y_{{ij}} (i = 1,2,...,N,\quad j = 1,2,...,q)\) represents the outcome of \(y_{i}\) in the jth dimension. Additionally, \(b_{pj} (p = 1,2,...,k,j = 1,2,...,q)\) represents the coefficient of feature p in the jth dimension. By using vector notations, \({{\varvec{\upsigma}}} = (\sigma_{1} ,...,\sigma_{q} )^{{ \top }}\), \({\mathbf{b}} = (b_{{11}} ,b_{{21}} ,...,b_{{k1}} ,b_{{12}} ,b_{{22}} ,...,b_{{k2}} ,...,b_{{1q}} ,b_{{2q}} ,...,b_{{kq}} )^{{ \top }}\) and \(Y_{i} = (y_{i1} ,y_{i2} ,...,y_{iq} )\), we can rewrite Eq. (1) as follows:

$$Y_{i} = {\mathbf{b}}^{{ \top }} X_{i} + {{\varvec{\upsigma}}}$$
(2)

where \({\mathbf{b}}\) is the solution of

$$\mathop {\min }\limits_{{\mathbf{b}}} \sum\limits_{i = 1}^{N} {(Y_{i} - {\mathbf{b}}^{{ \top }} {\mathbf{X}}_{i} )^{2} + \lambda \left\| {{\mathbf{b}}^{{ \top }} } \right\|_{2}^{2} }$$
(3)

\(\lambda\) is a penalty parameter that can be selected through out-of-sample cross-validation to optimize the model’s out-of-sample predictive performance.

Supervised learning

This section introduces common supervised learning technologies. Compared to traditional statistics, supervised learning methods exhibit certain desired properties when optimizing predictions in large datasets, such as transaction and financial time series data. In business and finance, supervised learning models have proven to be among the most effective tools for detecting credit card fraud (Lebichot et al. 2021). In the following subsections, we briefly describe the commonly used supervised ML methods for business and finance.

Shrinkage methods

The traditional least-squares method often yields complex models with an excessive number of explanatory variables. In particular, when the number of features, k, is large compared to the sample size N, the least-squares estimator, \({\hat{\mathbf{b}}}\), does not have good predictive properties, even if the conditional mean of the outcome is linear. To address this problem, regularization is typically used to adjust the estimation parameters dynamically and reduce the complexity of the model. The shrinkage method is the most common regularization method and can reduce the values of the parameters to be estimated. Shrinkage methods, such as ridge regression (Hoerl and Kennard 1970) and least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996), are linear regression models that add a penalty term to the size of the coefficients. This penalty term pushes the coefficients towards zero, effectively shrinking their values. Shrinkage methods can be effectively used to predict continuous outcomes or classification tasks, particularly when dealing with datasets containing numerous explanatory variables.

Compared to the traditional approach that estimates the regression function using least squares,

$${\hat{\mathbf{b}}} = \mathop {\min }\limits_{{\mathbf{b}}} \sum\limits_{i = 1}^{N} {(Y_{i} - {\mathbf{bX}}_{i} )^{2} }$$
(4)

shrinkage methods add a penalty term that shrinks \({\mathbf{b}}\) toward zero, aiming to minimize the following objective function:

$$\mathop {\min }\limits_{{\mathbf{b}}} \sum\limits_{i = 1}^{N} {(Y_{i} - {\mathbf{bX}}_{i} )^{2} } + \lambda (\left\| {\mathbf{b}} \right\|_{q} )^{\frac{1}{q}}$$
(5)

where \(\left\| {\mathbf{b}} \right\|_{q} = \sum\nolimits_{i = 1}^{N} {\left| {b_{i} } \right|^{q} }\). In \(q = 1\), this formulation leads to a LASSO. However, when \(q = 2\) is used, this formulation degenerates ridge regression.

Tree-based method

Regression trees (Breiman et al. 1984) and random forests (Breiman 2001) are effective methods for estimating regression functions with minimal tuning, especially when out-of-sample predictive abilities are required. Considering a sample \((x_{i1} ,...,x_{ik} ,Y_{i} )\) for \(i = 1,2,...,N\), the idea of a regression tree is to split the sample into subsamples where the regression functions are being estimated. The splits process is sequential and based on feature value \(x_{ij}\) exceeding threshold \(c\). Let \(R_{1} (j,c)\) and \(R_{2} (j,c)\) be two sets based on the feature \(j\) and threshold \(c\), where \(R_{1} (j,c) = \left\{ {{\mathbf{X}}_{i} |x_{ij} \le c} \right\}\) and \(R_{2} (j,c) = \left\{ {{\mathbf{X}}_{i} |x_{ij} > c} \right\}\). Naturally, the dataset \(R\) is divided into two parts, \(R_{1}\) and \(R_{2}\), based on the chosen feature and threshold.

Let \(c_{1} = \frac{1}{{|R_{1} |}}\sum\nolimits_{{{\mathbf{X}}_{i} \in R_{1} }} {x_{ij} }\) and \(c_{2} = \frac{1}{{|R_{2} |}}\sum\nolimits_{{{\mathbf{X}}_{i} \in R_{2} }} {x_{ij} }\), where \(| \bullet |\) refer to the cardinality of the set. Then we can construct the following optimization model to calculate the errors of the \(R_{1}\) and \(R_{2}\) datasets:

$$\mathop {\min }\limits_{j,c} \{ \mathop {\min }\limits_{{c_{1} }} \sum\limits_{{{\mathbf{X}}_{i} \in R_{1} }} {(Y_{i} - c_{1} )^{2} + \mathop {\min }\limits_{{c_{2} }} \sum\limits_{{{\mathbf{X}}_{i} \in R_{2} }} {(Y_{i} - c_{2} )^{2} } } \}$$
(6)

For all \(x_{ij}\) and threshold \(c \in ( - \infty , + \infty )\), the method finds the optimal feature \(j^{*}\) and threshold \(c^{*}\) that minimizes errors and splits the sample into subsets based on these criteria. By selecting the best feature and threshold, the method obtains the optimal classification of \(R_{1}^{*}\) and \(R_{2}^{*}\). This process is repeated recursively, leading to further splits that minimize the squared error and improve the overall model performance. However, researchers should be cautious about overfitting, wherein the model fits the training data too closely and fails to generalize well to new data. To address this issue, a penalty term can be added to the objective function to encourage simpler and more regularized models. The coefficients of the model are then selected through cross-validation, optimizing the penalty parameter to achieve the best trade-off between model complexity and predictive performance on new, unseen data. This helps prevent overfitting and ensures that the model's performance is robust and reliable.

Random forest builds on the tree algorithm to better estimate the regression function. This approach smooths the regression function by averaging across multiple trees, thus exhibiting two distinct differences. First, instead of using the original sample, each tree is constructed based on a bootstrap sample or a subsample of the data, a technique known as “bagging.” Second, at each stage of building a tree, the splits are not optimized over all possible features (covariates) but rather over a random subset of the features. Consequently, feature selection varies in each split, which enhances the diversity of the individual trees.

Deep learning and neural networks

Deep learning and neural networks have been proven to be highly effective in complex settings. However, it is worth noting that the practical implementation of deep learning often demands a considerable amount of tuning compared to other methods, such as decision trees or random forests.

Deep neural networks

As with any other supervised learning methods, deep neural networks (DNNs) can be viewed as a straightforward mapping \(y=f(x;\theta )\) from the input feature vector \(x\) to the output vector or scalar \(y\), which is governed by the unknown parameters \(\theta\). This mapping typically consists of layers that form chain-like structures. Figure 5 illustrates the structure of the DNN. For a DNN with multiple layers, the structure can be represented as

$$y = f^{(k)} (...f^{(2)} (f^{(1)} ,\theta_{1} ),\theta_{2} )...,\theta_{k} )$$
(7)
Fig. 5
figure 5

Structure of DNN

In a fully connected DNN, the \(i\)th layer has a structure given by \(h^{(i)} = f^{(i)} (x) = g^{(i)} ({\mathbf{W}}^{(i)} h^{(i - 1)} + {\mathbf{b}}^{(i)} )\), where \({\mathbf{W}}\) is the matrix of unknown parameters and \({\mathbf{b}}^{\left( i \right)}\) is the vector of basis factors. A typical choice for \(g^{\left( i \right)}\), called the “activation function,” can be a rectified linear unit, tanh transformation function, or sigmoid function. The 0th layer \(h^{(0)} = x\), which represents the input vector. The row dimension of \(b\) or the column dimension of the \({\mathbf{W}}\) species is the number of neurons in each layer. The weight matrix \({\mathbf{W}}\) is learned by minimizing a loss function, which can be the mean squared error for regression tasks or the cross-entropy for classification tasks. In particular, when the DNN has one layer, \(y\) is scalar. The activation function is set to linear or logistic, and we obtain a linear or logistic regression.

Convolutional neural networks

Although neural networks have many different architectures, the two most classical and relevant are convolutional neural networks (CNNs) and recurrent neural networks (RNNs). A classical CNN structure, which contains three main components—convolutional, pooling, and fully connected layers—is shown in Fig. 6. In contrast to the previously mentioned fully connected structure, in the convolutional layer, each neuron connects with only a small fraction of the neurons from the former layer; however, they share the same parameters. Therefore, sparse connections and parameter sharing significantly reduces the number of estimated parameters.

Fig. 6
figure 6

Structure of CNN

Different layers play different roles in the training process and are introduced in more detail as follows:

Convolutional layer: This layer comprises a collection of trained filters that are used to extract features from the input data. Assuming that \(X\) is the input and there are \(k\) filters, the output of the convolutional layer can be formulated as follows:

$$y_{j} = \sum\limits_{i} {f(x_{i} *\omega _{j} + b_{j} )}, \quad j = 1,2,...,k$$
(8)

where \(\omega_{j}\) and \(b_{j}\) denote the weights and bias, respectively; \(f\) represents the activation function; and \(*\) denotes the convolutional operator.

Pooling layer: This layer reduces the features and parameters of the network. The most popular pooling methods are the maximum and average pooling.

CNN are designed to handle one-dimensional time-series data or images. Intuitively, each convolutional layer can be considered a set of filters that move across images or shift along time sequences. For example, some filters may learn to detect textures, whereas others may identify specific shapes. Each filter generates a feature map and the subsequent convolutional layer integrates these features to create a more complex structure, resulting in a map of learned features. Suppose that \(S\) is an \(p \times p\) window size. Then the average pooling process can be formulated as

$$z = \frac{1}{N}\sum\limits_{(i,j) \in S} {x_{ij} } \quad i,j = 1,2,...,p$$
(9)

where \(x_{ij}\) is the activation value at location \((i,j)\), and N is the total number of \(S\).

Recurrent neural networks

Recurrent neural networks (RNNs) are well suited for processing sequential data, dynamic relations, and long-term dependencies. RNNs, particularly those employing long short-term memory (LSTM) cells, have become popular and have shown significant potential in natural language processing (Schmidhuber 2015). A key feature of this architecture is its ability to maintain past information over time using a cell-state vector. In each time step, new variables are combined with past information in the cell vector, enabling the RNN to learn how to encode information and determine which encoded information should be retained or forgotten. Similar to CNNs, RNN benefit from parameter sharing, which allows them to detect specific patterns in sequential data.

Figure 7 illustrates the structure of the LSTM network, which contains a memory unit\({C}_{t}\), a hidden state\({h}_{t}\), and three types of gates. Index \(t\) refers to the time step. At each step \(t\), the LTSM combines input \({x}_{t}\) with the previous hidden state \({h}_{t-1}\), calculates the activations of all gates, and updates the memory units and hidden states accordingly.

Fig. 7
figure 7

Structure of LSTM

The computations of LSTM networks are described as follows:

$$\begin{aligned} f_{t} & = \sigma (W_{f} x_{t} + \omega_{f} h_{t - 1} + b_{f} ) \\ i_{t} & = \sigma (W_{i} x_{t} + \omega_{i} h_{t - 1} + b_{i} ) \\ O_{t} & = \sigma (W_{o} x_{t} + \omega_{o} h_{t - 1} + b_{o} ) \\ C_{t} & = f_{t} \circ C_{t - 1} + i_{t} \circ \sigma_{c} (W_{c} x_{t} + \omega_{c} h_{t - 1} + b_{c} ) \\ h_{t} & = O_{t} \circ \tanh (C_{t} ) \\ \end{aligned}$$
(10)

where \(W\) denotes the weight of the inputs, and \(\omega_{f}\) and \(\omega_{i}\) represent the weights of the outputs and biases, respectively. The subscript \(f,i,{\text{ and }}O\) refer to the forget, input, and output gate vectors, respectively. \(b\) indicates biases and \(\circ\) is an element-wise multiplication.

Wavelet neural networks

Wavelet neural networks (Zhang and Benveniste 1992) use the wavelet function as the activation function, thus combining the advantages of both the wavelet transform and neural networks. The structure of wavelet neural networks is based on backpropagation neural networks, and the transfer function of the hidden layer neuron is the mother wavelet function. For input features \({\mathbf{x}} = (x_{1} ,...,x_{n} )\), the output of the hidden layer can be expressed as follows:

$$h(j) = h_{j} \left[ {\frac{{\sum\nolimits_{i = 1}^{n} {\omega_{ij} } x_{i} - b_{j} }}{{a_{j} }}} \right],\quad j = 1,2,...,m$$
(11)

where \(h(j)\) is the output value for neuron \(j\), \(h_{j}\) is the mother wavelet function, \(\omega_{ij}\) is the weight between the input and hidden layers, \(b_{j}\) is the shift factor, and \(a_{j}\) is the stretch factor for \(h_{j}\).

Support vector machine and kernels

Support vector machines (SVM) are flexible classification methods (Cortes and Vapnik 1995). Let us consider a binary classification problem, where we have an \(N\) observation \({\mathbf{X}}_{i}\), each with \(k\) features, and a binary label \(y_{i} \in \{ - 1,1\}\). Subsequently, a hyperplane \(x \in {\mathbf{\mathbb{R}}}\) s. t. \(w^{{ \top }} {\mathbf{X}}_{i} + b = 0\) is defined, which can be considered a binary classifier \({\text{sgn}} (w^{{ \top }} {\mathbf{X}}_{i} + b)\). The goal of SVM is to find a hyperplane such that the observations can be separated into two classes: + 1 and − 1. From the hyperplane space, SVM selects the option that maximizes the distance from the closest sample. In an SVM, there is typically a small set of samples with the same maximal distance, which are referred to as “support vectors.”

The above-mentioned process can be written as the following optimization model:

$$\mathop {\min }\limits_{\omega ,b} \, \frac{1}{2}\left\| \omega \right\|^{2}$$
(12)
$$s.t. \, Y_{i} (\omega^{{ \top }} {\mathbf{X}}_{i} + b) \ge 1, \quad {\text{ for all }}i = 1,...,N$$
(13)

To solve the above optimization model, we rewrite it in terms of Lagrangian multipliers as follows:

$$\begin{array}{*{20}l} {\mathop {\min }\limits_{\alpha ,\omega ,b} \left\{ {\frac{1}{2}\left\| \omega \right\|^{2} - \sum\limits_{i = 1}^{N} {\alpha_{i} (Y_{i} (\omega^{{ \top }} {\mathbf{X}}_{i} + b) - 1)} } \right\}} \hfill \\ {s.t.\quad \quad \alpha_{i} \ge 0} \hfill \\ \end{array}$$
(14)

where \(\alpha_{i}\) is the Lagrangian multiplier of the original restriction and \(Y_{i} (\omega^{{ \top }} {\mathbf{X}}_{i} + b) \ge 1\). The model above is equivalent to

$$\begin{array}{*{20}l} {\mathop {\max }\limits_{{{\varvec{\upalpha}}}} \left\{ {\sum\limits_{i = 1}^{N} {\alpha_{i} } - \frac{1}{2}\sum\limits_{i = 1}^{N} {\sum\limits_{j = 1}^{N} {\alpha_{i} \alpha_{j} Y_{i} Y_{j} {\mathbf{X}}_{i}^{{ \top }} {\mathbf{X}}_{j} } } } \right\}} \hfill \\ {s.t.\quad \quad \left\{ \begin{gathered} \alpha_{i} \ge 0 \hfill \\ \sum\limits_{i = 1}^{N} {\alpha_{i} Y_{i} } = 0 \hfill \\ \end{gathered} \right.} \hfill \\ \end{array}$$
(15)

We can obtain the Lagrangian multiplier \({{\varvec{\upalpha}}} = (\alpha_{1} ,...,\alpha_{N} )\) from Model (15), and then \(\widehat{b}\) can be solved from \(\sum\nolimits_{i = 1}^{N} {\hat{\alpha }_{i} (Y_{i} (\omega^{{ \top }} {\mathbf{X}}_{i} + b) - 1)} = 0\). Furthermore, we can obtain the classifier:

$$f(x) = {\text{sgn}} (\sum\limits_{{}}^{{}} {Y_{i} \hat{\alpha }_{i} } {\mathbf{X}}_{i}^{{ \top }} x + \widehat{b})$$
(16)

Traditional SVM assumes linearly separable training samples. However, SVM can also deal with non-linear cases by mapping the original covariates to a new feature space using the function \(\phi ({\mathbf{X}}_{i} )\) and then finding the optimal hyperplane in this transformed feature space; that is, \(f(x_{i} ) = \omega^{{ \top }} \phi (x_{i} ) + b\). Thus, the optimization problem in the transformed feature space can be formulated as

$$\begin{array}{*{20}l} {\mathop {\max }\limits_{\alpha } \left\{ {\sum\limits_{i = 1}^{N} {\alpha_{i} } - \frac{1}{2}\sum\limits_{i = 1}^{N} {\sum\limits_{j = 1}^{N} {\alpha_{i} \alpha_{j} Y_{i} Y_{j} K({\mathbf{X}}_{i} ,{\mathbf{X}}_{j} )} } } \right\}} \hfill \\ {s.t.\quad \quad \left\{ \begin{gathered} C \ge \alpha_{i} \ge 0 \hfill \\ \sum\limits_{i = 1}^{N} {\alpha_{i} Y_{i} } = 0 \hfill \\ \end{gathered} \right.} \hfill \\ \end{array}$$
(17)

where \(K({\mathbf{X}}_{i} ,{\mathbf{X}}_{j} ) = \phi ({\mathbf{X}}_{i} )^{{ \top }} \phi ({\mathbf{X}}_{j} )\). The kernel function \(K( \bullet )\) can be linear, polynomial, or sigmoid. Once the kernel function is determined, we can solve for the value of the Lagrangian multiplier \(\alpha\). Then \(\widehat{b}\) can be solved from \(\sum\nolimits_{i = 1}^{N} {\hat{\alpha }_{i} (Y_{i} (\omega^{{ \top }} {\mathbf{X}}_{i} + b) - 1)} = 0\), which allows us to derive the classifier:

$$f(x) = {\text{sgn}} (\sum\limits_{{}}^{{}} {Y_{i} \alpha_{i} } K({\mathbf{X}}_{i} ,x) + b)$$
(18)

Bayesian classifier

A Bayesian network is a graphical model that represents the probabilistic relationships among a set of features (Friedman et al. 1997). The Bayesian network structure \(S\) is a directed acyclic graph. Formally, a Bayesian network is a pair \(B = \left\langle {G,\Theta } \right\rangle\), where \(G\) is a directed acyclic graph whose nodes represent the random variable \(\left( {X_{1} ,...,X_{n} } \right)\), whose edges represent the dependencies between variables, and \(\Theta\) is the set of parameters that quantify the graph.

Assuming that there are \(q\) labels; that is, \({\mathbf{Y}} = \{ c_{1} ,...,c_{q} \}\), \(\lambda_{ij}\) is the loss caused by misclassifying the sample with the true label \(c_{j}\) as \(c_{i}\), and \({\mathbb{X}}\) represents the sample space. Then, based on the posterior probability \(P(c_{i} |{\mathbf{x}})\), we can calculate the expected loss of classifying sample \({\mathbf{x}}\) into the label \(c_{i}\) as follows:

$$R(c_{i} |{\mathbf{x}}) = \sum\limits_{j = 1}^{N} {\lambda_{ij} P(c_{j} |{\mathbf{x}})}$$
(19)

Therefore, the aim of the Bayesian classifier is to find a criterion \(h:{\mathbb{X}} \to {\mathbf{Y}}\) that minimizes the total risk

$$R(h) = E_{{\mathbf{x}}} [R(h({\mathbf{x}})|{\mathbf{x}})]$$
(20)

Obviously, for each sample \({\mathbf{x}}\), when \(h\) can minimize the conditional risk \(R(h({\mathbf{x}})|{\mathbf{x}})\), the total risk \(R(h)\) will also be minimized. This leads to the concept of Bayes decision rules: to minimize the total risk, we need to classify each sample into the label that minimizes the conditional risk \(R(h({\mathbf{x}})|{\mathbf{x}})\), namely

$$h^{*} ({\mathbf{x}}) = \mathop {\arg \min }\limits_{{c \in {\mathbf{Y}}}} R(c|{\mathbf{x}})$$
(21)

We then used \(h^{*}\) as the Bayes-optimal classifier and \(R(h^{*} )\) as the Bayes risk.

K-nearest neighbor

The K-nearest neighbor (KNN) algorithm is a lazy-learning algorithm because it defers to the induction process until classification is required (Wettschereck et al. 1997). The lazy-learning algorithm requires less computation time during the training process compared to eager-learning algorithms such as decision trees, neural networks, and Bayes networks. However, it may require additional time during the classification phase.

The kNN algorithm is based on the assumption that instances close to each other in a feature space are likely to have similar properties. If instances with the same classification label are found nearby, an unlabeled instance can be assigned the same class label as its nearest neighbors. kNN locates the k-nearest instances to the unlabeled instance and determines its label by observing the most frequent class label among these neighbors.

The choice of k significantly affects the performance of the kNN algorithm. Let us discuss the performance of kNN during \(k = 1\). Given sample \({\mathbf{x}}\) and its nearest sample \({\mathbf{z}}\), the probability of error can be expressed as follows:

$$P(err) = 1 - \sum\limits_{{c \in {\mathbf{Y}}}} {P(c|{\mathbf{x}})} P(c|{\mathbf{z}})$$
(22)

Suppose the samples are independent and identically distributed. For any \({\mathbf{x}}\) and any positive number \(\delta\), there always exists at least one sample \({\mathbf{z}}\) within a distance of \(\delta\) from \({\mathbf{x}}\). Let \(c^{*} ({\mathbf{x}})\mathop {\arg \min }\limits_{{c \in {\mathbf{Y}}}} P(c|{\mathbf{x}})\) be the outcome the Bayes optimal classifier. Then we have:

$$\begin{aligned} P(err & = 1 - \sum\limits_{{c \in {\mathbf{Y}}}} {P(c|{\mathbf{x}})P(c|{\mathbf{z}})} \\ & \approx 1 - \sum\limits_{{c \in {\mathbf{Y}}}} {P^{2} (c|{\mathbf{x}})} \\ & \le 1 - \sum\limits_{{c \in {\mathbf{Y}}}} {P^{2} (c^{*} |{\mathbf{x}})} \\ & \le 2(1 - \sum\limits_{{c \in {\mathbf{Y}}}} {P^{2} (c^{*} |{\mathbf{x}})} ) \\ \end{aligned}$$
(23)

According to (23), despite the simplicity of kNN, the generalization error is no more than twice that of the Bayes-optimal classifier.

Unsupervised learning

In unsupervised learning, researchers can only access observations without any labeled information, and their primary interest lies in partitioning a sample into subsamples or clusters. Unsupervised learning methods are particularly useful in descriptive tasks because they aim to find relationships in a data structure without measuring the outcomes. Several approaches commonly used in business and finance research fall under the umbrella of unsupervised learning, including k-means clustering and reinforcement learning. Accordingly, unsupervised learning can be used in qualitative business and finance. For example, it can be particularly beneficial during stakeholder analysis, when stakeholders must be mapped and classified by considering certain predefined attributes. It can also be useful for customer management. A company can employ an unsupervised ML method to cluster guests, which influences its marketing strategy for specific groups and leads to a competitive advantage. This section introduces unsupervised learning technologies that are widely used in business and finance.

K-means clustering

The K-means algorithm aims to find K points in the sample space and classify the samples that are closest to these points. Using an iterative method, the values of each cluster center are updated step-by-step to achieve the best clustering results. When partitioning the feature space into K clusters, the k-means algorithm selects centroids and assigns observations to clusters based on their proximity to them. \(b_{1} ,...,b_{k}\). The algorithm proceeds as follows. First, we begin with the K centroids \(b_{1} ,...,b_{k}\), which are initially scattered throughout the feature space. Next, in accordance with the chosen centroids, each observation is assigned to clusters that minimize the distance between the observation and the centroid of the cluster:

$$C_{i} = \mathop {\arg \min }\limits_{{c \in \{ 1,2,...,k\} }} \left\| {{\mathbf{X}}_{i} - b_{c} } \right\|^{2}$$
(24)

Next, we update the centroid by computing the average of \(X_{i}\) across each cluster:

$$b_{c} = {{\sum\limits_{{i:C_{i} = c}} {{\mathbf{X}}_{i} } } \mathord{\left/ {\vphantom {{\sum\limits_{{i:C_{i} = c}} {{\mathbf{X}}_{i} } } {\sum\limits_{i = 1}^{k} {I(C_{i} = c)} }}} \right. \kern-0pt} {\sum\limits_{i = 1}^{k} {I(C_{i} = c)} }}$$
(25)

where \(I( \bullet )\) is the indicative function. When choosing the number of clusters, K, we must exercise caution because no cross-validation method is available to compare the values.

Reinforcement learning

Reinforcement learning (RL) draws inspiration from the trial-and-error procedure conducted by Thorndike in his 1898 study of cat behavior. Originating from animal learning, RL aims to mimic human behavior by making decisions that maximize profits through interactions with the environment. Mnih et al. (2015) proposed deep RL by employing a deep Q-network to create an agent that outperformed a professional player in a game and further advanced the field of RL.

In deep RL, the learning algorithm plays an essential role in improving efficiency. These algorithms can be categorized into three types: value-based, policy-based, and model-based RL, as illustrated in Fig. 8.

Fig. 8
figure 8

Learning algorithm-based reinforcement learning

RL consists of four components—agent, state, action and reward—with the agent as its core. When an action leads to a profitable state, it receives a reward, otherwise, it is discouraged. In RL, an agent is defined as any decision-maker, while everything else is considered the environment. The interactions between the environments and the agents are described by state \(s\), action \(a\), and reward \(r\). At time step \(t\), the environment is in state \(s_{t}\), and the agent takes action \(a_{t}\). Consequently, the environment transitions to state \(s_{t + 1}\) and rewards agent \(r_{t + 1}\).

The agent’s decision is formalized by a policy \(\pi\), which maps state \(s\) to action \(a\). This is deterministic when the probability of choosing action \(a\) in state \(s\) equals one (i.e., \(\pi (a|s) = p(a|s) = 1\)). In contrast, it is stochastic when \(p(a|s) < 1\) is used. Policy \(\pi\) can be defined as the probability distribution of all actions selected from a certain \(s\), as follows:

$$\begin{aligned} \pi & = \Psi (s) \\ & = \left\{ {p(a_{i} |s)\left| {\forall a_{i} \in \Delta_{\pi } \wedge \sum\limits_{i} {p(a_{i} |s) = 1} } \right.} \right\} \\ \end{aligned}$$
(26)

where \(\Delta_{\pi }\) represents all possible actions of \(\pi\).

In each step, the agent receives an immediate reward \(r_{t + 1}\) until it reaches the final state \(s_{T}\). However, the immediate reward does not ensure a long-term profit. To address this, a generalized return value is used at time step \(t\), defined as \(R_{t}\):

$$R_{t} = r_{t + 1} + \gamma r_{t + 2} + \gamma^{2} r_{t + 3} + ... + \gamma^{T - t - 1} r_{T}$$
(27)

where \(0 \le \gamma \le 1\). The agents become more farsighted when \(\gamma\) approaches 1, and more shortsighted when it approaches 0.

The next step is to define a score function \(V\) to estimate the goodness of the state:

$$V_{\pi } (s) = E[R_{t} |s_{t} = s,\pi ]$$
(28)

Then, we determine the goodness of a state-action pair \((s,a)\):

$$Q_{\pi } (s,a) = E[R_{t} |s_{t} = s,a_{t} = a,\pi ]$$
(29)

Finally, we access the goodness between two policies:

$$\pi \le \pi^{\prime} \Leftrightarrow [V_{\pi } (s) \le V_{{\pi^{\prime}}} (s),\forall s] \vee [Q_{\pi } (s,a) \le Q_{{\pi^{\prime}}} (s,a),\forall (s,a)]$$
(30)

Finally, we can expand \(V_{\pi } (s)\) and \(Q_{\pi } (s,a)\) through \(R_{t}\) to represent the relationship between \(s\) and \(s_{t + 1}\) as

$$V_{\pi } (s) = \sum\limits_{a} {\pi (s,a)\sum\limits_{{s^{\prime}}} {p(s^{\prime}|s,a)(W_{{s \to s^{\prime}|a}} + \gamma V_{\pi } (s^{\prime}))} }$$
(31)

and

$$Q_{\pi } (s,a) = \sum\limits_{{s^{\prime}}} {p(s^{\prime}|s,a)(W_{{s \to s^{\prime}|a}} + \gamma \sum\limits_{{a^{\prime}}} {\pi (s^{\prime},a^{\prime})Q_{\pi } (s^{\prime},a^{\prime})} )}$$
(32)

where \(W_{{s \to s^{\prime}|a}} = E[r_{t + 1} |s_{t} = s,a_{t} = a,s_{t + 1} = s^{\prime}]\). By solving (31) and (32), we obtain \(V\) and \(S\), respectively.

Restricted Boltzmann machines

As Fig. 9 shows, a restricted Boltzmann machine (RBM) can be considered an undirected neural network with two layers, called the “hidden” and “visible” layers. Hidden layers are used to detect the features, whereas visible layers are used to train the input data. Given the \(n\) visible layers \(v\) and \(m\) hidden layers \(h\), the energy function is given by

$$E(v,h) = - \sum\limits_{i = 1}^{n} {a_{i} v_{i} } - \sum\limits_{j = 1}^{m} {b_{i} h_{j} } - \sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{m} {\alpha_{ij} v_{i} h_{j} } }$$
(33)

where \(\alpha_{ij}\) is the weight between the unit \(i\) \(j\), and \(a_{i}\) and \(b_{j}\) are the biases for \(v\) and \(h\), respectively.

Fig. 9
figure 9

Structure of RBM

Applications of machine learning techniques in business and finance

This section considers the application fields in the following categories: marketing, stock market, e-commerce, cryptocurrency, finance, accounting, credit risk management, and energy economy. This study reviews the application status of ML in these fields.

Marketing

ML is an innovative technology that can potentially improve forecasting models and assist in management decision-making. ML applications can be highly beneficial in the marketing domain because they rely heavily on building accurate predictive models from databases. Compared to the traditional statistical approach for forecasting consumer behavior, researchers have recently applied ML technology, which offers several distinctive advantages for data mining with large, noisy databases (Sirignano and Cont 2019). An early example of ML in marketing can be found in the work of Zahavi and Levin (1997), who used neural networks (NNs) to model consumer responses to direct marketing. Compared with the statistical approach, simple forms of NNs are free from the assumptions of normality or complete data, making them particularly robust in handling noisy data. Recently, as shown in Table 3, ML techniques have been predominantly used to study customer behaviors and demands. These applications enable marketers to gain valuable insights and make data-driven decisions to optimize marketing strategies.

Table 3 Machine learning techniques in marketing

Consumer behavior refers to the actions taken by consumers to request, use, and dispose of consumer goods, as well as the decision-making process that precedes and determines these actions. In the context of direct marketing, Cui et al. (2006) proposed Bayesian networks that learn by evolutionary programming to model consumer responses to direct marketing using a large direct marketing dataset. In the supply chain domain, Melancon et al. (2021) used gradient-boosted decision trees to predict service-level failures in advance and provide timely alerts to planners for proactive actions. Regarding unsupervised learning in consumer behavior analysis, Dingli et al. (2017) implemented a CNN and an RBM to predict customer churn. However, they found that their performance was comparable to that of supervised learning when introducing added complexity in specific operations and settings. Overall, ML techniques have demonstrated their potential for understanding and predicting consumer behavior, thereby enabling businesses to make informed decisions and optimize their marketing strategies (Machado and Karray 2022; Mao and Chao 2021).

Predicting consumer demand plays a critical role in helping enterprises efficiently arrange production and generate profits. Timoshenko and Hauser (2019) used a CNN to facilitate qualitative analysis by selecting the content for an efficient review. Zhang et al. (2020a, b) used a Bayesian learning model with a rich dataset to analyze the decision-making behavior of taxi drivers in a large Asian city to understand the key factors that drive the supply side of urban mobility markets. Ferreira et al. (2016) employed ML techniques to estimate historical lost sales and predict future demand for new products. For the application of consumer demand-level prediction, most of the research we reviewed used supervised learning technologies because learning consumer consumption preferences requires historical data of consumers, and only clustering consumers is insufficient to predict their consumption levels.

Stock market

ML applications in the stock market have gained immense popularity, with the majority focusing on financial time series for stock price predictions. Table 4 summarizes the reviewed articles that employed ML methods in stock market studies, including references, research objectives, data sources, applied techniques, and journals. Investing in the stock market can be highly profitable but also entails risk. Therefore, investors always try to determine and estimate stock values before taking any action. Researchers have mostly used ML techniques to predict stock prices (Bennett et al. 2022; Moon and Kim 2019). However, predicting stock values can be challenging due to the influence of uncontrollable economic and political factors that make it difficult to identify future market trends. Additionally, financial time-series data are often noisy and non-stationary, rendering traditional forecasting methods less reliable for stock value predictions. Researchers have explored ML in sentiment analysis to identify future trends in the stock market (Baba and Sevil 2021). Furthermore, other studies have focused on objectives such as algorithmic trading, portfolio management, and S&P 500 index trend prediction using ML techniques (Cuomo et al. 2022; Go and Hong 2019).

Table 4 Machine learning techniques in the stock market

Various ML techniques have been successfully applied for stock price predictions. Fischer and Krauss (2018) applied LSTM networks to predict the out-of-sample directional movements of the constituent stocks of the S&P 500 from 1992 to 2015, demonstrating that LSTM networks outperform memory-free classification methods. Wu et al. (2021) applied LASSO, random forest, gradient boosting, and a DNN to cross-sectional return predictions in hedge fund selection and found that ML techniques significantly outperformed four styles of hedge fund research indices in almost all situations. Bao et al. (2017) fed high-level denoising features into the LSTM to forecast the next day’s closing price. Sabeena and Venkata (2019) proposed a modified adversarial-network-based framework that integrated a gated recurrent unit and a CNN to acquire data from online financial sites and processed the obtained information using an adversarial network to generate predictions. Song et al. (2019) used deep learning methods to predict future stock prices. Sohangir et al. (2018) applied several NN models to stock market opinions posted on StockTwits to determine whether deep learning models could be adapted to improve the performance of sentiment analysis on StockTwits. Bianchi et al. (2021) showed that extreme trees and NNs provide strong statistical evidence in favor of bond return predictability. Vo et al. (2019) proposed a deep responsible investment portfolio model containing an LSTM network to predict stock returns. All of these stock price applications use supervised learning techniques and financial time-series data to supervise learning. In contrast, it is challenging to apply unsupervised learning methods, particularly clustering, in this domain (Chullamonthon and Tangamchit 2023). However, RL still has certain applications in the stock markets. Lei (2020) combined deep learning and RL models to develop a time-driven, feature-aware joint deep RL model for financial time-series forecasting in algorithmic trading, thus demonstrating the potential of RL in this domain.

Additionally, the evidence suggests that hybrid LSTM methods can outperform other single-supervised ML methods in certain scenarios. Thus, in applying ML to the stock market, researchers have explored the combination of LSTM with different methods to develop hybrid models for improved performance. For instance, Tamura et al. (2018) used LSTM to predict stock prices and reported that the accuracy test results outperformed those of other models, indicating the effectiveness of the hybrid LSTM approach in stock price prediction.

Researchers have explored various hybrid approaches that combine wavelet transforms and LSTM with other techniques to predict stock prices and financial time series. Bao et al. (2017) established a new method for predicting stock prices that integrated wavelet transforms, stacked autoencoders, and LSTM. In the first stage, they eliminate noise to decompose the stock price time series. In the next stage, predictive features for the stock price are created. Finally, LSTM is applied to predict the next day’s closing price based on the features of the previous stage. The authors claimed that their model outperformed state-of-the-art models in terms of predictive accuracy and profitability. To address the non-linearity and non-stationary characteristics of financial time series, Yan and Ouyang (2018) integrated wavelet analysis with LSTM to forecast the daily closing price of the Shanghai Composite Index. Their proposed model outperformed multiple layer perceptron, SVM, and KNN with respect to finding patterns in financial time-series data. Fang et al. (2019) developed a methodology to predict exchange trade–fund option prices by integrating LSTM with support vector regression (SVR). They used two LSTM-SVR models to model the final transaction price. In the second generation of LSTM-SVR, the hidden state vectors of the LSTM and the seven factors affecting the option price were considered as SVR inputs. Their proposed model outperformed other methods, including LSTM and RF, in predicting option prices.

E-commerce

Online shopping, which allows users to purchase products from companies via the Internet, falls under the umbrella of e-commerce. In today’s rapidly evolving online shopping landscape, companies employ effective methods to recognize their buyers’ purchasing patterns, thereby enhancing their overall client experience. Customer reviews play a crucial role in this process as they are not only utilized by companies to improve their products and services but also by customers to assess the quality of a product and make informed purchase decisions (Da et al. 2022). Consequently, the decision-making process is significantly improved through analysis of reviews that provide valuable insights to customers.

Traditionally, enterprises’ e-commerce strategic planning involves assessing the performance of organizational e-commerce adoption behavior at the strategic level. In this context, the decision-making process exhibits typical behavioral characteristics. With regard to organizations’ adoption of technology, it is important to note that the entity adopting the technology is no longer an individual but the organization as a whole. However, technology adoption decisions are still made by people within an organization, and these decisions are influenced by individual cognitive factors (Zha et al. 2021). Individuals involved in the decision-making process have their own perspectives, beliefs, and cognitive biases, which can significantly impact an organization’s technology adoption choices and strategies (Li et al. 2019; Xu et al. 2021). Therefore, the behavioral perspective of technology acceptance provides a new perspective for e-commerce strategic planning research. With the development of ML, research on technology acceptance has been hindered by the limitations of traditional strategic e-commerce planning. Different general models of information technology acceptance behaviors are commonly explored.

Table 5 provides a summary of the aforementioned studies. Cui et al. (2021) constructed an e-commerce product marketing model based on an SVM to improve the marketing effects of e-commerce products. Pang and Zhang (2021) built an SVM model to more effectively solve the decision support problem of e-commerce strategic planning. To increase buyers’ trust in the quality of the products and encourage online purchases, Saravanan and Charanya (2018) designed an algorithm that categorizes products based on several criteria, including reviews and ratings from other users. They proposed a hybrid feature-extraction method using an SVM to classify and separate products based on their features, best product ratings, and positive reviews. Wang et al. (2018a, b, c) employed LSTM to improve the effectiveness and efficiency of mapping customer requirements to design parameters. The results of their model revealed the superior performance of the RNN over the KNN. Xu et al. (2019) designed an advanced credit risk evaluation system for e-commerce platforms to minimize the transaction risks associated with buyers and sellers. To this end, they employed a hybrid ML model combined with a decision tree ANN (DT-ANN) and found that it had high accuracy and outperformed other hybrid ML models, such as logistic regression and dynamic Bayesian network. Cai et al. (2018) used deep RL to develop an algorithm to address the allocation of impression problems on e-commerce websites such as www.taobao.com, www.ebay.com, and www.amazon.com. In this algorithm, buyers are allocated to sellers based on their impressions and strategies to maximize the income of the platform. To do so, they applied a gated recurrent unit, and their findings demonstrated that it outperformed a deep deterministic policy gradient. Wu and Yan (2018) claimed that the main assumption of current production recommender models for e-commerce websites is that all historical user data are recorded. In practice, however, many platforms fail to capture such data. Consequently, they devised a list-wise DNN to model the temporal online behavior of users and offered recommendations for anonymous users.

Table 5 Machine learning techniques in e-commerce

Accounting

In the accounting field, ML techniques are employed to detect fraud and estimate accounting indicators. Most companies’ financial statements reflect accounts or disclosure amounts that require estimations. Accounting estimates are pervasive in financial statements and often significantly impact a company’s financial position and operational results. The evolution of financial reporting frameworks has led to the increased use of fair value measurements, which necessitates estimation. Most financial statement items are based on subjective managerial estimates and ML has the potential to provide an independent estimate generator (Kou et al. 2021).

Chen and Shi (2020) utilized bagging and boosting ensemble strategies to develop two models: bagged-proportion support vector machines (pSVM) and boosted-pSVMs. Using datasets from LibSVM, they tested their models and demonstrated that ensemble learning strategies significantly enhanced model performance in bankruptcy prediction. Lin et al. (2019) emphasized the importance of finding the best match between feature selection and classification techniques to improve the prediction performance of bankruptcy prediction models. Their results revealed that using a genetic algorithm as the wrapper-based feature selection method, combined with naïve Bayes and support vector machine classifiers, resulted in remarkable predictive performance. Faris et al. (2019) investigated a combination of resampling (oversampling) techniques and multiple election method features to improve the accuracy of bankruptcy prediction methods. According to their findings, employing the oversampling technique and the AdaBoost ensemble method using a reduced error pruning (REP) tree provided reliable and promising results for bankruptcy prediction.

The earlier studies by Perols (2011) and Perols et al. (2017) were among the first to predict accounting fraud. Two recent studies by Bao et al. (2020) and Bertomeu et al. (2020) used various accounting variables to improve the detection of ongoing irregularities. Bao et al. (2020) employed ensemble learning to develop a fraud-prediction model that demonstrated superior performance compared to the logistic regression and support vector machine models with a financial kernel. Huang et al. (2014) used Bayesian networks to extract textual opinions, and their findings showed that they outperformed dictionary-based approaches, both general and financial. Ding et al. (2020) used insurance companies’ data on loss reserve estimates and realizations and documented that the loss estimates generated by ML were superior to the actual managerial estimates reported in financial statements in four out of the five insurance lines examined.

Many companies commission accounting firms to handle accounting and bookkeeping and provide them access to transaction data, documentation, and other relevant information. Mapping daily financial transactions into accounts is one of the most common accounting tasks. Therefore, Jorgensen and Igel (2021) devised ML systems based on random forest to automate the mapping process of financial transfers to the appropriate accounts. Their approach achieved an impressive accuracy of 80.50%, outperforming baseline methods that either excluded transaction text or relied on lexical bag-of-words text representations. The success of ML systems indicates the potential of ML to streamline accounting processes and increase the efficiency of financial transaction’ mapping. Table 6 summarizes the ML techniques described in “Accounting” section.

Table 6 Machine learning techniques in accounting

Credit risk management

The scoring process is an essential part of the credit risk management system used in financial institutions to predict the risk of loan applications because credit scores imply a certain probability of default. Hence, credit scoring modes have been widely developed and investigated for credit approval assessment of new applicants. This process uses a statistical model that considers both the application and performance data of a credit or loan applicant to estimate the likelihood of default, which is the most significant factor used by lenders to prioritize applicants in decision-making. Given the substantial volume of decisions involved in the consumer lending business, it is necessary to rely on models and algorithms rather than on human discretion (Bao et al. 2019; Husmann et al. 2022; Liu et al. 2019). Furthermore, such algorithmic decisions are based on “hard” information, such as consumer credit file characteristics collected by credit bureau agencies.

Supervised and unsupervised ML methods are widely used for credit risk management. Supervised ML techniques are used in credit scoring models to determine the relationships between customer features and credit default risk and subsequently predict classifications. Unsupervised techniques, mainly clustering algorithms, are used as data mining techniques to group samples into clusters (Wang et al. 2019). Hence, unsupervised learning techniques often complement supervised techniques in credit risk management.

Despite the high accuracy of ML, it is not possible to explain its predictions. However, financial institutions must maintain transparency in their decision-making processes. Fortunately, researchers have shown that ML can deduce rules to mitigate a lack of transparency without compromising accuracy (Baesens et al. 2003). Table 7 summarizes the recent applications of ML methods in credit risk management. Liu et al. (2022) use KNN, SVM, and random forest to predict the default probability of online loan borrowers and compare their prediction performance with that of a logistic model. Khandani et al. (2010) applied regression trees to construct non-linear, non-parametric forecasting models for consumer credit risk.

Table 7 Machine learning techniques in credit risk management

Cryptocurrency

A cryptocurrency is a digital or virtual currency used to securely exchange and transfer assets. Cryptography is used to securely transfer assets, control and regulate the addition of cryptocurrencies, and secure their transactions (Garcia et al. 2014); hence, the term “cryptocurrency.” In contrast to standard currencies, which depend on the central banking system, cryptocurrencies are founded on the principle of decentralized control (Zhao 2021). Owing to its uncontrolled and untraceable nature, the cryptocurrency market has evolved exponentially over a short period. The growing interest in cryptocurrencies in the fields of economics and finance has drawn the attention of researchers in this domain. However, the applications of cryptocurrencies and associated technologies are not limited to financing. There is a significant body of computer science literature that focuses on the supporting technologies of cryptocurrencies, which can lead to innovative and efficient approaches for handling Bitcoin and other cryptocurrencies, as well as addressing their price volatility and other related technologies (Khedr et al. 2021).

Generating an accurate prediction model for such complex problems is challenging. As a result, cryptocurrency price prediction is still in its nascent stages and further research efforts are required to explore this area. In recent years, ML has become one of the most popular approaches for cryptocurrency price prediction owing to its ability to identify general trends and fluctuations. Table 8 presents a survey of cryptocurrency price prediction research using ML methods. Derbentsev et al. (2019) presented a short-term forecasting model to predict the cryptocurrency prices of Ripples, Bitcoin, and Ethereum using an ML approach. Greaves and Au (2015) applied blockchain data to Bitcoin price predictions and employed various ML techniques, including SVM, ANN, and linear and logistic regression. Among the ML classifiers used, the NN classifier with two hidden layers achieved the highest price accuracy of 55%, followed by logistic regression and SVM. Additionally, the research mentioned an analysis using several tree-based models and KNN.

Table 8 Machine learning techniques in cryptocurrency

The most recent LSTM networks appear to be more suitable and convenient for handling sequential data, such as time series. Lahmiri and Bekiros (2019) were the first to use LSTM to predict the digital currency prices of the three currencies that were used the most at the time they conducted their study: Bitcoin, Ripple, and digital cash. In their study, long memory was used to assess the market efficiency of cryptocurrencies, and the inherent non-linear dynamics encompassing chaoticity and fractality were examined to gauge the predictability of digital currencies. Chowdhury et al. (2020) applied LSTM to the indices and constituents of cryptocurrencies to predict prices. Lahmiri and Bekiros (2019) implemented LSTM to forecast the prices of the three most widely traded cryptocurrencies. Furthermore, Altan et al. (2019) built a novel hybrid forecasting model based on LSTM to predict digital currency time series.

Energy

The existing applications of ML techniques in energy economics can be classified into two major categories: energy price and energy demand prediction. Energy prices typically demonstrate complex features, such as non-linearity, lag dependence, and non-stationarity, which present challenges for the application of simple traditional models (Chen et al. 2018). Owing to their high flexibility, ML techniques can provide superior prediction performance. In energy demand predictions, lagged values of consumption and socioeconomic and technological variables, such as GDP per capita, population, and technology trends, are typically utilized. Table 9 presents a summary of these studies. A critical distinction between “price” and “consumption” prediction is that the latter is not subject to market efficiency dynamics. The prediction of consumption has little effect on the actual consumption of the agents. However, price prediction tends to offset itself by creating opportunities for traders to use this information.

Table 9 Machine learning techniques in energy marketing

Predicting prices in energy markets is a complicated process because prices are subject to physical constraints on electricity generation and transmission and market power potential (Young et al. 2014). Predicting prices using ML techniques is one of the oldest applications in energy economics. In the early 2000s, a wave of studies attempted to forecast electricity prices using conventional ANN techniques. Ding (2018) combined ensemble empirical mode decomposition and an artificial NN to forecast international crude oil prices. Zhang et al. (2020a, b) employed the LSTM method to forecast day-ahead electricity prices in a deregulated electricity market. They also investigated the intricate dependence structure within the price-forecasting model. Peng et al. (2018) applied LSTM with a differential evolution algorithm to predict electricity prices. Lago et al. (2018) first proposed a DNN to improve the predictive accuracy in a local market and then proposed a second model that simultaneously predicts prices from two markets to further improve the forecasting accuracy. Huang and Wang (2018) proposed a model that combines wavelet NNs with random time-effective functions to improve the prediction accuracy of crude oil price fluctuations.

Understanding the future energy demand and consumption is essential for short- and long-term planning. A wide range of users, including government agencies, local development authorities, financial institutions, and trading institutions, are interested in obtaining realistic forecasts of future consumption portfolios (Lei et al. 2020). For demand prediction, Chen et al. (2018) used ridge regression to combine extreme gradient boosting forest and feedforward deep networks to predict the annual household electricity consumption. Wang et al. (2018a, b, c) first built a model using a self-adaptive multi-verse optimizer to optimize the SVM and then employed it to predict China’s primary energy consumption.

Critical discussions and future research directions

ML techniques have proven valuable in establishing computational models that capture complex relationships with the available data. Consequently, ML has become a useful tool in business and finance. This section critically discusses the existing research and outlines future directions.

Critical discussions

Although ML techniques are widely employed in business and finance, several issues need to be addressed.

  1. 1.

    Linguistic information is abundant in business and finance, encompassing online commodity comments and investors’ emotional responses in the stock market. Nonetheless, the existing research has predominantly concentrated on processing numerical data. When juxtaposed with numerical information, linguistic data harbor intricate characteristics, notably personalized individual semantics (Li et al. 2022a, b; Zhang et al. 2021a, b; Hoang and Wiegratz 2022).

  2. 2.

    The integration of ML into business and finance can lead to interpretability issues. In ML, an interpretable model refers to one in which a human observer can readily comprehend how the model transforms an observation into a prediction (Freitas 2014). Typically, decision-makers are hesitant to accept recommendations generated by ML techniques unless they can grasp the reasoning behind them. Unfortunately, the existing research in business and finance, particularly those employing DNNs, has seldom emphasized the interpretability of their models.

  3. 3.

    Social networks are prevalent in the marketing domain within businesses (Zha et al. 2020). For instance, social networks exist among consumers, whose purchasing behavior is influenced by the opinions of trusted peers or friends. However, the existing research that applies ML to marketing has predominantly concentrated on personal customer attributes, such as personality, purchasing power, and preferences (Dong et al. 2021). Regrettably, the potential impact of social networks and their influence on customer behavior have been largely overlooked in these studies.

  4. 4.

    ML techniques typically focus on exploring the statistical relationships between dependent and independent variables and emphasize feature correlations. However, in the context of business and finance applications, causal relationships exist between variables. For instance, consider a study suggesting that girls who have breakfast tend to have lower weights than those who do not’, based on which one might conclude that having breakfast aids in weight loss. However, in reality, these two events may only exhibit a correlation rather than causation (Yao et al. 2021). Causality plays a significant role in ML techniques’ performance. However, many current business and finance applications have failed to account for this crucial factor. Ignoring causality may lead to misleading conclusions and hinder accurate modeling of real-world scenarios. Therefore, incorporating causality into ML methodologies within the business and finance domains is essential for enhancing the reliability and validity of predictive models and decision-making processes.

  5. 5.

    In the emerging cryptocurrency field, although traditional statistical methods are simple to implement and interpret, they require many unrealistic statistical assumptions, making ML the best technology in this field. Although many ML techniques exist, challenges remain in accurately predicting cryptocurrency prices. However, most ML techniques require further investigation.

  6. 6.

    In recent years, rapid growth in digital payments has led to significant shifts in fraud and financial crimes (Canhoto 2021; Prusti et al. 2022; Wang et al. 2023). While some studies have shown the effective use of ML in detecting financial crimes, there remains a limitation in the research dedicated to this area. As highlighted by Pourhabibi et al. (2020), the complex nature of financial crime detection applications poses challenges in terms of deploying and achieving the desired detection performance levels. These challenges are manifested in two primary aspects. First, ML solutions encounter substantial pressure to deliver real-time responses owing to the constraints of processing data in real time. Second, in addition to inherent data noise, criminals often attempt to introduce deceptive data to obfuscate illicit activities (Pitropakis et al. 2019). Regrettably, few studies have investigated the robustness and performance of the underlying algorithmic solutions when confronted with data quality issues.

  7. 7.

    In the finance domain, an important limitation of the current literature on energy and ML is that most works highlight the computer science perspective to optimize computational parameters (e.g., the accuracy rate), while finance intuition may be ignored.

Future research directions

Thus, we propose that future research on this topic follow the directions below:

  1. 1.

    As analyzed above, there is abundant linguistic information exists in business and finance. Consequently, leveraging natural language processing technology to handle and analyze linguistic data in these domains represents a highly promising research direction.

  2. 2.

    The amalgamation of theoretical models using ML techniques is an important research topic. The incorporation of interpretable models can effectively reveal the black-box nature of ML-driven analyses, thereby elucidating the underlying reasoning behind the results. Consequently, the introduction of interpretable models into business and finance while applying ML can yield substantial benefits.

  3. 3.

    The interactions and behaviors are often intertwined within social networks, making it crucial to incorporate social network dynamics when modeling their influence on consumer behavior. Introducing the social network aspect into ML models has tremendous potential for enhancing marketing strategies and outcomes  (Trandafili and Biba 2013).

  4. 4.

    Causality has garnered increasing attention in the field of ML in recent years. Accordingly, we believe it is an intriguing avenue to explore when applying ML to address problems in business and finance.

  5. 5.

    Further studies need to include all relevant factors affecting market mood and track them over a longer period to understand the anomalous behavior of cryptocurrencies and their prices. We recommend that researchers analyze the use of LSTM models in future research, such as CNN LSTM and encoder–decoder LSTM, and compare the results to obtain future insights and improve price prediction results. In addition, researchers can apply sentiment analysis to collect social signals, which can be further enhanced by improving the quality of content and using more content sources. Another area of opportunity is the use of more specialized models with different types of approaches, such as LSTM networks.

  6. 6.

    Graph NNs and emerging adaptive solutions provide important opportunities for shaping the future of fraud and financial crime detection owing to their parallel structures. Because of the complexity of digital transaction processing and the ever-changing nature of fraud, robustness should be treated as the primary design goal when applying ML to detect financial crimes. Finally, focusing on real-time responses and data noise issues is necessary to improve the performance of current ML solutions for financial crime detection.

  7. 7.

    Currently, the application of unsupervised learning methods in different areas, such as marketing and risk management, is limited. Some problems related to marketing and customer management could be analyzed using clustering techniques, such as K-means, to segment clients by different demographic or behavioral characteristics and by their likelihood of default or switching companies. In energy risk management, extreme events can be identified as outliers using principal component analysis or ranking algorithms.

Conclusions

Having already made notable contributions to business and finance, ML techniques for addressing issues in these domains are significantly increasing. This review discusses advancements in ML in business and finance by examining seven research directions of ML techniques: cryptocurrency, marketing, e-commerce, energy marketing, stock market, accounting, and credit risk management. Deep learning models, such as DNN, CNN, RNN, random forests, and SVM are highlighted in almost every domain of business and finance. Finally, we analyze some limitations of existing studies and suggest several avenues for future research. This review is helpful for researchers in understanding the progress of ML applications in business and finance, thereby promoting further developments in these fields.