1 Introduction

The concept of artificial intelligence (AI) and its subset, machine learning (ML), traces its roots back to Turing (1950) and Samuel (1959) respectively. Turing introduced the concept through the “imitation game” to assess a machine's ability to simulate human-like behavior. Samuel, in turn, defined ML as “the field of study that gives computers the ability to learn without being programmed”. However, despite its importance, ML lacks a universally accepted definition, leading to a gap in research (Gu et al. 2020). This ambiguity raises questions about the delineation between ML, broader AI, and purely statistical models. Additionally, the classification of a novel algorithm or methodology as part of the ML domain is left to the discretion of the author. Our research adopts Samuel’s (1959) definition, including statistical regression models within the domain of ML, while excluding “explicitly programmed” methods associated with AI. In this study, we utilize bibliometric analysis to provide insights into the application of ML within the accounting and finance (A&F) discipline. We aim to explore the development of academic research in this area, shed light on its current focus, and identify potential future avenues of study.

Frequently, scholars encounter a dilemma when selecting between statistical methods and ML approaches. Statistical methods often provide clear interpretations of results, allowing for a better understanding of the relationships between variables. Moreover, statistical methods, are often based on well-defined assumptions, which can help in understanding the limitations of the model, and they are useful for making inferences about populations based on sample data, especially in hypothesis testing and confidence interval estimation. Lastly, they perform well with smaller datasets and limited computational resources. However, if the assumptions underlying the statistical model are violated, it can lead to inaccurate or misleading results. In addition, some statistical methods might struggle to handle complex relationships or high-dimensional data which leads to limited predictive accuracy. Finally, they are less flexible in adapting to various types of data structures or patterns.

On the other hand, ML algorithms can often handle complex patterns and large datasets, resulting in better predictive accuracy. They are characterized by flexibility and can adapt to different data structures and are often more versatile in handling various types of data, including unstructured data like images and text. In addition, many ML algorithms automatically learn relevant features from the data. Finally, machine learning algorithms usually scale efficiently with big data by leveraging parallel processing and distributed computing. However, one of the major critiques against ML models is that they can be challenging to interpret, leading to a lack of transparency in understanding how the model arrives at its decisions. In addition, these models may suffer from overfitting especially when the model complexity is high and the data is limited. Moreover, ML algorithms are computationally expensive and, usually, larger amounts of data is needed compared to statistical methods further increasing the training times. In practice, the choice between statistical methods and ML often depends on the specific problem, the available data, the desired level of interpretability, and the trade-offs between predictive accuracy and understanding the underlying relationships in the data.

ML algorithms are commonly separated into three main categories based on the training methodology; supervise, unsupervised and reinforcement learning. Supervised models require labeled datasets in which training data pairs (x, y) are provided during the training phase. The dependent variable y is used to enhance performance by iteratively adjusting the model parameters, (see, Mohri et al. 2012; Vapnik 2000). For example, in plant identification, images of plants along their corresponding plant names are presented to the ML model. This type of model is used in both classification and regression problems. On the other hand, unsupervised models identify patterns and hidden features without labeled data. Clustering, association, and dimensionality reduction are common problems addressed through those models. Finally, reinforcement learning algorithms autonomously learn from their environment through trial and error. Data scientists provide reward and penalty rules that are used by the agent of the algorithm to distinguish correct from wrong actions. This type of models is commonly used in gaming industry and robot navigation applications, Russell and Norvig (2020).

Resent ML applications in A&F encapsulate a plethora of traditional subjects, inter alia fraud identification (e.g., Achakzai and Peng 2023; Debener et al. 2023), sentiment extraction from financial corpus (e.g., Blankespoor et al. 2023; Huang et al. 2023), portfolio optimization (e.g., Kaniel et al. 2023; Wang et al. 2024), and bankruptcy predictions (e.g., Cao et al. 2022; Nguyen et al. 2023). Moreover, ML utilization also incorporates the latest trends related to cryptocurrency markets (e.g., Cohen 2023; Han et al. 2024), environmental implications (e.g., Frost et al. 2023; Sautner et al. 2023) and Covid-19 impact (e.g., Chortareas et al. 2024; Yang et al. 2024) indicating a continuous interest in their capabilities.

The above discussion suggests several reasons for undertaking this study. Firstly, empirical studies on ML in A&F have experienced a significant increase in recent years and are widely scattered across various academic journals, making it challenging to obtain a clear picture of this expanding research area. Secondly, to our knowledge, no other study presents a comprehensive literature review on ML in A&F. While a limited number of review studies address this issue to some extent from either an AI or a ML perspective (Gray et al. 2014; Sutton et al. 2016; El-Haj et al. 2019; Weigand 2019; Karolyi and Van Nieuwerburgh 2020; Ahmed et al. 2022; Han et al. 2023), their focus is considerably narrower than the exhaustive coverage offered in this review. For instance, Han et al.’s (2023) review focuses on blockchain applications in accounting, whereas El-Haj et al. (2019) examines the utilization of Computational Linguistics, a subset of AI, in financial disclosure. Thirdly, this review responds to the recent callsFootnote 1 for more research in machine learning within A&F.

To address the research gap discussed above, we employ a combination of quantitative techniques, such as bibliometric analysis, and a critical review of all identified research foci within the literature corpus. This approach enables us to offer a comprehensive and interdisciplinary synthesis of knowledge in this field. Specifically, we aim to address three key research questions in this stream of research: RQ1 How is research on the impact of ML on A&F research developed? RQ2 What is the focus within this corpus of literature? RQ3 What are the future avenues of ML in A&F research? We adopt a critical approach to the research foci identified in the corpus of literature by analyzing 575 papers from 93 established quality journals and providing a comprehensive and interdisciplinary synthesized state of knowledge regarding this field of study.

Our results reveal an increased interest in this field since 2015 with the majority of studies focused either on the US market or on a global scale. Publications related to Asian markets gained momentum as they increased by 950% during 2020–2022. Further, our analysis highlights we show that supervised models are by far the most frequently applied, contrary to unsupervised models, which are mainly focused either on topic extraction through the Latent Dirichlet Allocation (LDA) algorithm or clustering. Additionally, through comprehensive bibliographic analysis, our study identifies six distinct clusters. For each cluster we present the key topics, we examine the current challenges, and discuss the various prospective opportunities. We also examine and propose future avenues of research of ML in A&F. Finally, we analytically present and discuss the various limitations of ML and possible directions for future research to overcome them.

Our research contributes to the relevant literature in several ways. Firstly, to the best of our knowledge, this is the first literature review that investigates both A&F research exclusively under the prism of ML. Secondly, our analysis focuses on established journals included in the 2021 Academic Journal Guide (AJG) in the field of A&F (ranked as 4*, 4, 3, 2 and 1) to ensure that our findings are derived from high-quality academic research and to identify the direction of future research. Third, for each cluster we summarize the research topics, as well as the preferred methodologies and the best performing models for each topic. This summary intends to offer valuable guidance to scholars and early career researchers interested in employing ML in the fields of A&F.

The paper is organized as follows. In Sect. 2 we introduce our methodology. Section 3 discusses the results regarding the three interrelated research questions and adopts a critical approach to the research foci identified in the corpus of literature. In Sect. 4 we present opportunities and challenges of ML. Finally, Sect. 5 outlines the main conclusions and presents the limitations of the paper.

2 Methodology

In this study, we utilize bibliometric analysisFootnote 2 to provide insights into the application of ML within the A&F discipline. We aim to explore the development of academic research in this area, shed light on its current focus, and identify potential future avenues of study. Bibliometric serves the dual purpose of mitigating author bias (MacCoun 1998) and efficiently summarizing extensive datasets (Broadus 1987). Given the nascent stage of ML in A&F we combine quantitative with qualitative data by complement bibliometric analysis with literature review for a deeper understanding of our topic (Rialti et al. 2019). This combination of techniques differentiates our study from previous literature research in Artificial Intelligence and ML (Goodell et al. 2021; Ahmed et al. 2022; Ranta et al. 2022) by shedding light on traditional methods, current implications and comparison with ML models for A&F tasks.

Firstly, we conduct our initial research in August 2023 exclusively in the Web of Science (WoS) database as considered the most credible, transparent, and reliable source of information (Modak et al. 2019; Levine-Clark and Gil 2021). In our search keyword selection process, we begin by identifying literature papers that are both pertinent to our study and have been published in high-quality journals rated as 4* (internationally recognized as examples of excellence), 4 (top journals in the field), and 3 (highly regarded journals) following the guidelines established by the Chartered Association of Business Schools through the Academic Journal Guide (AJG). Our analysis aims to reveal keywords from titles, author-provided keywords, and publication abstracts that comprehensively address our research topic. In particular, we adopt the ML keywords outlined in the review conducted by Ghoddusi et al. (2019), adjusting our query to align with the objectives of our study. Therefore, our keyword list is composed of 25 keywords related to ML and 13 to A&F. To further expand our research, we also apply stemming and the asterisk “*” regular expression as suffix for generic keywords (e.g., Auditing becomes Audit*) as supported by WoS database. The final query can be found in Appendix.

As a next step, we refine our search results by retrieving any published article until 2022. Additionally, we restrict our search to peer-reviewed and scholarly journals included in the 2021 AJG guide in the field of A&F (ranked as 4*, 4, 3, 2 and 1) to ensure a minimum level of research quality (Chartered Assosiation of Business Schools 2021; Harvey et al. 2010). As a consequence, the initial dataset comprises 3,709 articles, of which 1,004 align with our rating criteria. After eliminating articles that are not pertinent to our research and those that are irretrievable, our final corpus consists of 575 articles.

To answer our three research questions, we employ quantitative and qualitative tools to conduct our analysis. Specifically, for RQ1, we perform preliminary descriptive analysis using the Bibliometrix R software (Aria and Cuccurullo 2017) to gain further insights into our corpus. Furthermore, we examine the evolution and contribution of journals in academic research, and we present the geographic focus trends.

To address our RQ2, we carry out Bibliometric coupling, a method known for its capacity to produce highly accurate clustering results (Boyack and Klavans 2010). This method hinges on the assumption that publications delving into similar topics will share common citations; the greater the number of shared citations, the stronger the connection between the articles, and the higher the likelihood that they address a shared topic. Hence, through bibliographic coupling, we can cluster our results and conduct a comprehensive literature review to uncover the key topics and considerations addressed in these clusters. We have selected VOSviewer (Van Eck and Waltman 2010), a tool widely employed by scholars for literature review and known for its graphical representation of clustering (e.g., Ciampi et al. 2021). Furthermore, to enhance our understanding on cluster topics, we construct wordclouds through bag-of words, a standard technique exercised in literature analysis (Baker et al. 2021).

Lastly, to identify potential future avenues for our RQ3, we employ, through VOSviewer tool, co-word analysis in author specific keywords. Thus, each keyword is distributed through time unfolding trends and patterns. Moreover, our literature review identified key considerations and limitations of ML that could be put under the microscope for future research.

3 Results

3.1 (RQ1) How is research on the impact of ML on A&F developed?

To address our initial research question, we undertake a quantitative analysis of 575 publications to offer an overview of current research. In particular, we identify trends in number of publications per year while pinpointing the top journals within our topic. Moreover, we perform thorough literature review to recognize the geographic data sources and trends of selected publications. Finally, we identify the most impactful countries based on publications and citations.

Firstly, the descriptive statistics of selected publications that emerged from Bibliometrix R (Aria and Cuccurullo 2017) are included in Table 1. Specifically, this table is composed of four panels namely “Main Data Information”, “Document Contents”, “Authors”, and “Document Types”, each providing a different aspect of our corpus. With an annual growth rate of 15.92% and average document age of 3.27 years, ML research in A&F is a trending topic in its infancy. Additionally, a higher co-authorship average in comparison to previous literature analyses by Gaunt (2014) and Korkeamaki et al. (2018) in A&F publications, may indicate the incorporated complexity of machine learning application in these disciplines. “Document types” panel, illustrates the distribution of publications per document type including early access articles to be published after 2022; an indicative trend of a continuous momentum in this topic.

Table 1 Descriptive statistics

The paper from Gray (1996) represents the oldest article in our corpus, and it was published in the “Journal of Financial Economics”. Consistent with prior research, our analysis reveals that the interest in ML applications within the A&F domain started to surge in 2015, a trend that remains robust due to technological advancements enabling the execution of these algorithms on conventional personal computers. These findings are presented in Fig. 1. Specifically, we categorize journals with ten or fewer publications as "Other" and exclude 53 unpublished articles from our analysis. Among the journals, "Quantitative Finance" stands out by contributing 16.35% of the papers in our corpus, while "Finance Research Letters" published the highest number of articles in 2022, totaling 30. Noticeably, 88.38% of publications included in our corpus relate to finance discipline. This can be attributed to several factors. In practice, accounting captures and provides information (dual role of accounting, valuation and contracting perspective), while finance considers this information to make informed decisions (Baker and Wurgler 2002; Ruch and Taylor 2015). Despite the fact that A&F often co-exist in a single academic unit (Smith and Urquhart 2018), the skills and expertise required for each field differ accordingly. While ML is used in both fields, it seems more prevalent in finance due to its wider range of applications. ML is ideal for handling the vast amount of complex data available in financial markets, enabling real-time decision-making and predictive analysis. On the other hand, accounting frequently entails nuanced tasks, such as providing tax advice or handling matters that require more intuition, where ML may not be as effective. Also, most ML models lack transparency making ML adaptation more difficult for practitioners who often must clearly justify their decision-making process. Finally, maybe it is just the lack of expertise. However, over the past decade, there has been a significant increase in publications within the field of accounting that incorporate ML algorithms. Especially following the publication of Loughran and McDonald's paper in 2011, there has been an exponential rise in the utilization of text analysis applications in both A&F.

Fig. 1
figure 1

Number of papers per journal and published year

In Table 2, we provide a list of the top 10 journals from a total of 93, based on several key criteria, including the number of articles, total citations, average citations per article, h-index,Footnote 3 g-indexFootnote 4 and journal impact factor as provided by Clarivate. Each journal is compared and ranked based on these six criteria, with at least four of them showing better performance compared to the preceding journal. In the event of a tie, we consider the AJG Ranking as the determinant of the journal's impact.

Table 2 Top journals in ML in A&F

Observing, through literature review, a considerable concentration of studies within a particular geographic region, it becomes imperative to replicate these experiments in alternative markets or on a global scale. This approach serves two critical purposes: validating results and examining potential variations across different geographic locations. Hence, we implement a geographic data filter, assessing the specific data utilized in each publication. In some cases, we identify datasets encompassing multiple countries, whether from the same continent or across different continents. In the first case we classify those papers to the related continent, while in the latter we label them as “Global”. Our results can be found in Table 3 presenting up to the top five regions per continent. Literature that either lacks explicit mention of the dataset's composition or does not utilize any dataset has been excluded from our analysis.

Table 3 ML application in continents and countries

While the majority of publications are tailored to the US markets, almost 30% of papers are focused on global datasets. We further examine this trend in Fig. 2, depicting the evolution of the scientific focus over the years. Please note that early access papers published after 2022 have been removed from our analysis.

Fig. 2
figure 2

Geographic dataset trends

With few exceptions, a significant amount of ML applications in North American datasets has been observed since 1996, constituting 38.26% of our dataset population. While there has been a positive upward trend since 2016, the proportion attributed to this group has stabilized in 2022, despite the ongoing increase in total publications. Specifically, publications for Asian markets increased by 975%, European by 243% and Global by 95.83% over the last 3 years. During the same timeframe, we notice only one publication for Oceania and one publication per year for Africa. In summary, although as expected a considerable number of papers still rely on North American data in absolute terms, there is a noticeable increase in applications across the globe, particularly in Asian datasets, reflecting a growing trend.

Lastly, we perform bibliographic analysis through VosViewer (VOSviewer—Visualizing scientific landscapes 2021) to identify the most influential countries through average citations per paper in our corpus. Countries with five or more papers published are presented in Fig. 3. Papers from Austria and the United States are the most cited (with 86 and 33.31 average citations respectively), in contrary to Japan and Thailand being on the other side of this spectrum (2.58 and 4.6 respectively).

Fig. 3
figure 3

Bibliographic coupling countries

Stepping further into our analysis, we calculate the most influential countries in terms of publications by measuring total publications, total citations, and average citations per publication. In particular, the United Stated of America published 27.48% of papers (N = 158) achieving 5,295 citations followed by China which accredited with 1,132 citations from 102 publications. Countries with more than 200 citations are illustrated in Fig. 4. Table 4 presents the top 10 countries in relation to h-index, g-index and total citations. Each country is compared and ordered by the three criteria having at least better performance in two of them in relation to the proceeding country.

Fig. 4
figure 4

Most influential countries (> 200 times cited)

Table 4 Most influential countries

3.2 (RQ2) What is the focus within this corpus of literature?

3.2.1 Bibliographic coupling results

In this section, we provide an in-depth analysis of our corpus of literature by initially performing clustering through bibliographic coupling and combining bag-of words with literature review to identify the key topics for each cluster. Moreover, we extract key considerations and proposed ML models as to assist future research. Lastly, we perform cluster distribution over time to indicate the academic focus and progression.

Bibliographic coupling analysis reveals six clusters by linking the citing documents based on the number of papers cited together. VOSviewer (VOSviewer—Visualizing scientific landscapes 2021) constructs distance-based maps in which the smaller the distance between two items, the stronger their relation (Van Eck and Waltman 2010). In Fig. 5 we present the relative map where each node represents a paper and its color indicates the cluster assigned. The node size indicates the times a paper is cited. The produced map enhances our motivation to review all papers in order to identify the characteristics of each cluster; common citations between groups and overlapping districts indicate a somewhat strong correlation between some of the papers in different clusters. Natural Language Processing (NLP) analysis techniques namely Bag-Of-Words and n-grams are employed on keywords, titles and abstracts of publications solidifying cluster results. Specifically, for each created cluster, we consolidated information collected from keywords, titles and abstracts of included publications, and applied stemming process. Finally, we calculated the occurrence of each word or n-gram withing the cluster to construct our wordclouds. Words that are common across all clusters, such as “machine” and “learning”, are excluded from this analysis. The full corpus is available as online supplementary material.Footnote 5

Fig. 5
figure 5

Bibliographic coupling map

3.2.2 Red cluster: markets and time-series forecasts

The first cluster consists of 172 publications with a total of 4,149 citations. Our analysis reveals that the most common bi-grams in this cluster are “Neural Network” (64 occurrences), “Time Series” (61 occurrences), “Stock Market” (50 occurrences), “Covid-19” (50 occurrences) and “Exchange Rate” (33 occurrences). As presented in Fig. 6 panel A, this extensive cluster encompasses topics related to market volatility, portfolio creation, trading behavior, cryptocurrency, and forex markets, all of which are associated with “time series” analysis. The dynamic, noisy, and non-linear nature of financial time series forecasting makes it a complex endeavor (Karathanasopoulos et al. 2015).

Fig. 6
figure 6

Wordclouds of six clusters. Note The font size indicates the times each word occurs withing the cluster

Starting with volatility subcluster, the primary goal is to minimize tolerated risk and maximize gains. Traditional GARCH and Stochastic Volatility models are not suitable in the current high-frequency data environment (Liu et al. 2018). Furthermore, the GARCH model is prone to explosive conditional variance, which has implications for volatility forecasting (Gray 1996). Moreover, the Markowitz mean–variance portfolio model (Markowitz 1952), ignores transaction costs and is more prone to estimation error than minimum variance models (Clarke et al. 2011). Thus, the topic of discussion in this subcluster is the comparison of econometric models with ML models that have the ability to process multi-dimensional and non-linear data. For example, the results of Hu and Tsoukalas (1999) indicate that realized volatility approximates stock volatility through a non-linear approach in which neural networks outperform the GARCH model. The choice of model selection for forecasting depends on the time frame, with long-term volatility forecasts favoring ML models over the econometric GARCH for Forex and the Chinese CSI 300 index, while errors remain identical for short-term forecasts (Zhai et al. 2020). Combining Neural Networks with MC-GARCH for high-frequency data processing is another option that shows promising results, achieving higher accuracy compared to standalone MC-GARCH. Similarly, the Heterogeneous Autoregressive model (HAR), yields exceptional results on high-frequency oil price prediction (Gkillas et al. 2020), but underperforms over Neural Networks and tree-based ML algorithms for stock volatility forecast (Christensen et al. 2022). Lastly, the combination of ML models for volatility forecasting is also under the microscope. For example, Qiu et al. (2020) propose the combination of the HAR model with random forest for forecasting the price volatility of 100 Exchange traded funds (ETF).

Portfolio creation is the second subcluster of this group. Traditional approximations of optimal asset allocation, assessing the risk/benefit ratio, include the Black-Litterman (Black and Litterman 1992) and the Markowitz models (Markowitz 1952). The first model suffers from high transactional costs and does not account for stock-specific views, while the latter is sensitive to assumptions. Rebalancing models and the naïve 1/N rule, which involves equally investing in N assets, often perform better than the Markowitz model (Mulvey et al. 2001; DeMiguel et al. 2007). Thus, creating portfolios of as independent assets as possible and rebalancing only when a certain risk threshold is exceeded is a promising alternative (Liu et al. 2015; Li et al. 2016). Supervised, unsupervised, and reinforcement models are proposed as replacements or enhancements to the econometric ones. Pyo and Lee (2018) recognize low-risk anomalies and the outperformance of low-risk portfolios compared to high-risk ones. They experiment with ML algorithms and the GARCH model to forecast volatility initially. Subsequently, they integrate these forecasts with the Black-Litterman model for portfolio construction purposes. The combination of the Markowitz model with Neural Networks (Bradrania et al. 2021) or with Convolutional Neural Networks (CNN) and reinforcement learning (Aboussalah et al. 2021), seems to exceed expectations. In contrast, the poor performance of most reinforcement learning algorithms is attributed to noisy and non-stationary financial environments (Aboussalah et al. 2021).

Identification of key aspects of trading behavior, strategies, and price forecasting is the topic of the third subcluster. Starting with the identification of informed trading, current models are computationally intensive in choice of initial parameters requiring up to months of effort (Gan et al. 2015). The unsupervised Hierarchical Agglomerative clustering model is a better alternative in terms of speed and accuracy (Gan et al. 2015; Lin et al. 2021). In the pursuit of automated trading, Genetic Algorithm is able to learn trading strategies and apply them accordingly however after inclusion of transaction costs, their return is similar to a buy and hold strategy (Allen and Karjalainen 1999). One solution is to import risk factors into the model and apply boosting algorithms that avoid unnecessary costly trades by relying exclusively on “strong” signals for decision making (Creamer and Freund 2010). In financial prediction, current econometric and statistical models like ARMA, ARIMA, VAR and GARCH perform well with linear data but struggle when this assumption is violated (Wu et al. 2019). The comparison of ARMA models with Support Vector Machines (SVM) for indices prediction enhances the above statement (Karathanasopoulos et al. 2015). We also find that Neural Networks has been used with success stock prices prediction (Chen and Ge 2019; Zhang et al. 2021) however Genetic Programming could be a better alternative (Dunis et al. 2013).

Crypto-currencies provide an intriguing avenue for investigating market efficiency in high-frequency trading by training ML algorithms to simulate investor actions (Manahov and Urquhart 2021) or predicting prices on a daily timeframe by incorporating significant macroeconomic variables (Liu et al. 2021). Unsupervised models through clustering can also be employed to examine market efficiency and behavior by identifying bubbles (El Montasser et al. 2022). Moreover, the realm of price prediction remains a central topic in academic research. In Aggarwal et al, (2020) it is shown that the accuracy of the SVM model fluctuates based on the selected forecasting period. Notably, forecasting the bitcoin price for the fifth day in the future yields more accurate results than predicting the 15th or 30th day. Lastly, ML models and technical analysis play a role in identifying trends in cryptocurrency prices. However, the absence of observed abnormal returns suggests that cryptocurrency markets may be more analogous to traditional financial markets (Anghel 2021).

The fourth subcluster comprises nine publications that assess the impact of the Covid-19 pandemic on financial markets and aim to identify the key features that minimize risk. Specifically, social media and Covid-related news improve market impact predictability, with varying impacts on Gulf Cooperation Council countries (Al-Maadid et al. 2022). Through Hierarchical clustering, interconnectedness of markets is examined both before and after the financial crisis. Tightly connected countries tend to strengthen their interconnectedness in contrast to less strongly connected ones that tend to move closer to another cluster after the crisis (León et al. 2017). At an industry level, examining the correlation between positive and negative news related to COVID-19 and US stock prices can serve as an indicator of systemic risk (Baek et al. 2020). Zaremba et al. (2021) investigate 67 markets using multiple factors and observe that low unemployment rates, conservative investment policies, and undervalued companies tend to be more protected from global pandemics. Lastly, when comparing simple linear regression with ML regression models, it is observed that Support Vector Regression (SVR) and Random Forest achieved better accuracy in correlating the Covid-19 death rate with stock market performance in India (Behera et al. 2022).

In summary, this cluster identifies the importance of ML for time-series forecasting as traditional models are more suitable for linear data. However, a combination of econometric with ML models could be beneficial as previous research indicates. The supervised models SVM, Neural Networks and tree-based models are quite common while we find Hierarchical clustering to be the choice for explaining the relationship between entities such as trading patterns.

3.2.3 Green cluster: textual analysis

This cluster consists of 115 publications accredited with 2695 citations. Our analysis reveals two prominent ML models: “Neural Network” (25 occurrences) and “Support Vector Machine” (13 occurrences). The most common bi-grams are “Textual Analysis” (43), “Social Media” (26 occurrences), “Fraud Detection” (26 occurrences), “Annual Reports” (25 occurrences) and “Data Mining” (25 occurrences). Notable tri-grams include “Corporate Social Responsibility” (14 occurrences), “Natural Language Processing” (11 occurrences), “Accounting Information Systems” (9 occurrences) and “Insurance Fraud Detection” (9 occurrences). As Fig. 6 panel B indicates, the key distinction of this cluster is its focus on experimentation with textual data, enabling the extraction of sentiment and the identification of the key corporate insights, particularly in the context of fraud detection.

A common approach for sentiment extraction involves word classification, categorizing words as positive, neutral, negative, or other categories to calculate the sentiment of each sentence and extend it to the entire document. This approach often relies on dictionaries and lexicons, which can help remove subjectivity and reduce the researcher's efforts (Loughran and McDonald 2016). However, word lists may not always be readily available, and issues related to homographs can arise (Loughran and McDonald 2016). Furthermore, domain-specific dictionaries may not perform effectively in different contexts (Bochkay et al. 2019). In contrast, ML models can be trained to discover text features and assign unique sentiment weight to individual words via manual classification of the training and verification sets. In our corpus, social media provide the textual data required for stock price prediction using ML techniques (Renault 2017; Chun et al. 2020; Vamossy 2021) This data allows for the assessment of investor sentiment and beliefs regarding specific firms at given times. Forcasting trading trends is also achievable by identifing topics from news articles via an unsupervised ML model known as LDA (Han and Kim 2021a). Commonly identified ML models include Naïve Bayes (Li 2010; Slapnik and Lončarski 2021), SVM (Liu et al. 2021) and Neural Networks (Chun et al. 2020; Saurabh and Dey 2020; Azimi and Agrawal 2021).

Our second subcluster is focused on the detection of misreporting in financial statements, which has repercussions for both investors and employees and contributes to uncertainty in financial markets. Challenges in this group are the ratio of the nonfraud firms to fraud ones, which affects ML classification, and to find the right attributes as often are noisy due to the attempt to mask the financial statements to be as similar as possible to the ones found in nonfraud firms (Perols 2011). Addressing class imbalance often involves under-sampling, aiming for a 1:4 ratio between fraud and non-fraud instances. However, it's crucial to evaluate additional metrics such as the F-measure or area under the curve (AUC) to comprehensively assess model performance. (Papík and Papíková 2022). Different sets of variables are proposed as fraud detection indicators including raw financial data instead of ratios (Bao et al. 2020), combinations of ratios, raw data and dummy variables (Perols 2011), inclusion of non-accounting variables such as governance, capital markets, and auditing (Bertomeu et al. 2020) and even textual analysis (Chen et al. 2017; Brown et al. 2020; Zhang et al. 2022). Regarding the latter category, an NLP method called Term Frequency-Inverse Document Frequency (TF-IDF) can identify the most important fraudulent accounting narratives in annual reports, enabling Queen Genetic Algorithm and SVM models for the classification of fraud and non-fraud financial statements (Chen et al. 2017). While the “Bag of Words” technique assesses word relevance based solely on the frequency of words within a document, TF-IDF considers not only the frequency of words in a document but also considers how frequently those words appear across the entire collection of documents. A more sophisticated method known as “word embedding” takes into account the sentence structure and creates multidimensional vector representations for each word, allowing for similar representations of synonym words. SVM combined with “word embedding”, achieved an accuracy of 77% in fraud detection for Chinese firms (Zhang et al. 2022).

In summary, textual analysis can be conducted through a range of techniques, from simple word counting to vector representation of words. ML can complement both approaches for solving regression, classification, and clustering problems, given their capability to handle multidimensional data. Among unsupervised models, LDA is widely used for topic extraction. In our corpus, neural networks and SVM supervised models are frequently employed and have shown promising results.

3.2.4 Blue cluster: options and limit order trading

The third cluster consists of 89 publications with a total of 904 citations. Our textual analysis indicates keywords related to both ML and traditional models namely “Deep Learning” (69 occurrences), “Neural Networks” (37 occurrences), “Monte Carlo” (20 occurrences) and “Long Short-term Memory” (11 occurrences). The most common topic-related n-grams are “Option Pricing” (24 occurrences), “Limit Order Book” (20 occurrences) and “High Dimensional” (20 occurrences), while Fig. 6 panel C illustrates the most frequently repeated keywords in this cluster, including "Price" and "Option".

European, American, and Bermudan option pricing and hedging are scrutinized due to limitations of commonly used traditional techniques. For European options, the Black and Scholes (1973) and the Heston (1993) models are among the most frequently applied methods for identifying underlying prices. However, both models are parametric and rely on certain assumptions that can impact accuracy. For instance, the Black–Scholes model assumes constant volatility, leading to inaccurate prices (Funahashi 2020; Nian et al. 2021). Slow Monte Carlo method, another common choice for option pricing, increase accuracy but at the expense of speed (Horvath et al. 2021), which can be problematic in high-volatility markets. In contrast, ML models, specifically Neural Networks in our corpus, are trained directly on market data, avoiding the misspecification issues that parametric models can suffer from (Nian et al. 2021). Modeling American options is even more complex as they can be exercised at any time during the contract’s life. The regression-based Monte Carlo approach is gaining popularity; however, the objectivity of variable selection in regression-based methods can be compromised, particularly in the case of high-dimensional options (Hu and Zastawniak 2020). Research suggests either combining Monte Carlo with ML or replacing it. For example, in Goudenège et al. (2020), the combination of trees and Monte Carlo methods for pricing American options results in a computation time decrease without sacrificing accuracy. On the contrary, De Spiegeleer et al. (2018) advocate for a new ML model that achieves faster execution speeds compared to Monte Carlo, even if it involves sacrificing some accuracy within acceptable thresholds. In our corpus, Neural Networks are the most commonly applied ML family, including CNN, Long Short-term Memory (LSTM) and recursive neural networks (RNN) (Jang and Lee 2019; Wei et al. 2020; Zhang and Huang 2021).

In the context of limit order trading, the evaluation of the immediate stock pricing movement relies on the asks and bids order flows in high-frequency trading environments. One common approximation involves statistical modeling; however, these methods often require assumptions and intensive computations, making them less suitable for existing trading environments (Bouchaud et al. 2002). Modeling in this context is further complicated by the high-dimensionality of a limit order book, which includes multiple price levels (Sirignano 2018). At the same time, information provided by order book is often overlooked by human observers due to their short-lived nature (Kercheval and Zhang 2015). A deep learning solution has the capacity to generalize and identify relationships between order flow and market prices in a non-parametric manner that can be applied to different stocks (Sirignano and Cont 2019). Neural Networks, for instance, can determine whether price changes result from successful or cancelled orders by considering the order series (Tashiro et al. 2019). However, to train machine models effectively, it is necessary to include multiple levels of limit orders, accounting for factors such as order size (number of stocks), price, and precise placement time. In high-frequency trading markets, data is abundant, which can make model training a time-consuming process. To address this, training is often carried out on GPUs rather than CPUs, as graphic cards can efficiently parallelize the training process across thousands units (Sirignano and Cont 2019). Except for Neural Networks that dominate this group, Random Forest have been used to identify high order cancellation rate impact on the market (McInish et al. 2019) or SVM to predict mid-sized movement (Kercheval and Zhang 2015).

In summary, ML is a good alternative to option pricing and limit order trading for mainly two reasons. Firstly, the current traditional models are prone to assumptions in contrary to ML models which learn by facts. Secondly, due to the high-dimensional nature of this application, current models can be slow to produce a reliable outcome in a highly volatile environment. Lastly, Neural Networks are recognized both for their demand and their ability to handle substantial amount of data; therefore, this characteristic could explain both the researchers’ preference for neural networks as well as their success in this cluster.

3.2.5 Yellow cluster: risk management

This cluster comprises 76 articles credited with 632 citations, with the most prolific publication (Butaru et al. 2016) cited 81 times. Our analysis unveils two ML models, “Random Forest” (19 occurrences) and “Neural Networks” (14 occurrences). The most common topic related bi-grams are “Credit Risk” (16 occurrences) and “Risk Management” (15 occurrences), while tri-grams “Telematics Car Driving” (10 occurrences) and “Loss Given Default” (9 occurrences) conclude the focus of this cluster.

Academic research offers a plethora of indicators for achieving financial distress prediction through early warning systems. These indicators encompass consumer credit risk (Butaru et al. 2016; Zanin 2020), volatility (Laborda and Olmo 2021), transaction and connectivity network construction (Akbari et al. 2021; Laborda and Olmo 2021) in combination with traditional financial indicators. Robust modeling requires a substantial amount of data, while a crisis is a rare event thus creating data imbalance. To address this, the creation of synthetic data (Zanin 2020) and application of Synthetic Minority Oversampling Technique (SMOTE) (Lee 2020) are proposed to mitigate imbalances during the training phase of supervised ML models. In the context of short time horizons for credit card delinquency prediction, empirical comparisons have favored Random Forests and decision trees over logistic regression when evaluating and selecting models (Butaru et al. 2016). This preference for ML models over regression models is attributed to their superior accuracy, driven by their ability to handle non-linear data (Colak et al. 2020; Amini et al. 2021).

Risk management is of utmost importance for the insurance and lending industries, and publications on this topic are included in this cluster. In the insurance industry, the Random Forest model has demonstrated the capability to replace human predictions of future claim payments, yielding superior estimates (Ding et al. 2020). While the LASSO model can be used for the same task, it should be avoided when working with small datasets, as it can impact prediction accuracy (Devriendt et al. 2021). In addition, Neural Networks can process complex telematics car driving data to measure risk scores and identify driving styles and patterns, offering viable tools for insurance firms (Gao et al. 2022; Meng et al. 2022). Research on discrimination in the lending industry suggests that Random Forest is either able to capture structural relationships or uncover “identities” of minorities leading to a lower acceptance ratio for mortgage loans (Fuster et al. 2021). Loan officers may consider both 'hard' and 'soft' data, but the delinquency rate on approved loans by gradient boosting models is 33% lower compared to the decisions made by human experts.

Our literature review reveals an additional subcluster focused on the implementation of ML within the real estate domain. ML models are compared to linear regression models as an alternatives for tasks such as house pricing and rent estimation (Deppner and Cajias 2022), commercial pricing (Calainho et al. 2022) and renovation premiums (Mamre and Sommervoll 2022). The conclusion from those publications is that models like Random Forest, XGboost and Bagging outperform linear regression models. Calainho et al. (2022) attribute this performance to the combination of processing non-linear data with the non-parametric nature of those models.

In conclusion for this cluster, the linear nature of simple regression models emerges as a significant factor driving the adoption of tree-based ML algorithms. Simultaneously, the structure of the data appears to play a crucial role in model selection. Additionally, when dealing with unbalanced data, various techniques should be applied in classification problems.

3.2.6 Purple cluster: bankruptcy prediction, credit risk

This cluster encompasses 70 articles cited 1,094 times, with the most prolific publication (Khandani et al. 2010) being cited 266 times. Our analysis reveals the prominence of three ML models: “Neural Networks” (35 occurrences), “Support Vector Machine” (27 occurrences) and “Random Forest” (23 occurrences). The most common topic-related bi-grams are “Bankruptcy Prediction” (32 occurrences), “Financial Ratios” (31 occurrences), “Credit Risk” (27 occurrences), “Banking Crisis” (25 occurrences) and “Early Warning” (25 occurrences).

In bankruptcy prediction, the focus is on the identification of the correct set of attributes able to forecast imminent delinquency but a variety of methods and financial features are selected in publications. In (Mselmi et al. 2017) several ML models are compared against the statistical logit model to predict financial distress in 212 French firms. The combination of SVM with partial least squares achieves a forecast accuracy of 94.28% for a two-year forecast compared to the 92.86% of the SVM model. SVM outperforms logit model in the default prediction of German firms as well, supplied by eight predictors (Chen et al. 2011). Lahmiri and Bekiros (2019) explore the use of qualitative, rather than quantitative, data and highlight that statistical model assumptions, such as multivariate normality, are often violated.

In credit risk subcluster, consumer behavior is treated as an important factor for predicting financial distress. ML models are capable of forecasting delinquency between 3 and 12 months in advance, even when a small training population is provided (Khandani et al. 2010). Moreover, the ElasticNet regression model can enhance the performance of ML models by identifying the most significant features for credit score classification (Xu et al. 2019). SVM is a common choice for credit analysis (Yu et al. 2020; Ala’raj et al. 2018), however in case of small datasets, logistic regression may be a better alternative (Ala’raj et al. 2018).

In the banking sector, rating agencies frequently misclassify banks, highlighting the need for improved predictions (Viswanathan et al. 2020). Traditional econometric tools for banking crisis assume that individual factors can explain their occurrence. In contrast, Duttagupta and Cashin (2011) through binary classification tree propose that a combination of factors must occur. Le and Viviani (2018) achieve small accuracy improvements using ML over traditional statistical models by measuring 31 different ratios from banking financial statements. In contrary, Beutel et al. (2019) promote logistic regression over ML models as the latter underperform in out-of-sample data in comparison to former. This observation, however, may be subject to overfitting, as a small dataset applied. Lastly, Viswanathan et al. (2020) classify 44 Indian banks through unsupervised K-means based on their credit risk using financial statements and ratios. They employ LDA algorithm for topic extraction to explain the clusters, and turn to supervised models such as Classification and Regression Tree (CART) and Random Forest for predicting credit ratings by comparing result to rating agencies. In light of these varying results, it is clear that consensus on the most important features for predicting banking crises has yet to be reached, necessitating further research.

Most publications within this group are referring to classification problems, therefore supervised models are chosen. SVM, Neural Networks and treed based models are commonly found however there is an indication that for small datasets, simpler statistical models may be more suitable.

3.2.7 Cyan cluster: asset pricing

This cluster consists of 43 publications accredited with 878 citations. Word analysis, indicated four prevailing bi-grams which are “Asset Pricing” (33 occurrences), “Cross Section” (25), “Stock Returns” 22 occurrences) and “Stock Market” (19 occurrences). As Fig. 6 panel F illustrates, this group relates to asset pricing, the Holy Grail for investors and financial institutes given that an accurate estimation of the fair price allows for investment opportunities while minimizing risk. Many models have been proposed since the Capital Asset Pricing model (CAPM) was first introduced, as previous research revealed empirical failures (Karolyi and Van Nieuwerburgh 2020) and a growing number of anomalies incorporated into newer models (Geertsema and Lu 2020). The three-factor model (Fama and French 1993) was replaced by, among others, the four-factor model (Hou et al. 2015) and the five-factor model (Fama and French 2015). However, those models fail to encapsulate the full spectrum of cases as the problem is high-dimensional and, therefore, a larger number of characteristics is needed (Kozak et al. 2020). Although hundreds of estimators have been proposed for both cross-section and time series problems, often highly correlated and investigated under a linear prism (Weigand 2019; Gu et al. 2020), regression models require the incorporation of a priori knowledge of multiple predictors.

One proposed solution, is to combine a plethora of predictors, thereby generating new variables that effectively mitigate dimensionality and correlation issues (Fang et al. 2020). Statistical tools have been used for factor selection but not for the construction of new ones. Fang et al. (2020), propose a ML approach combining multiple Neural Networks with a “prior knowledge” feature to create and select the best features for the prediction model. Azevedo and Hoegner (2022) demonstrate the significance of non-linearity and high dimensionality. They achieved nearly 2% monthly returns using the Gradient Boosting model, outperforming both linear and traditional models such as the CAPM, the four-factor (Hou et al. 2015), the three-factor (Fama and French 1993) and the five-factor (Fama and French 2015) models. Non-linearities seems to be an important factor for predicting abnormal bond returns thus ML may be useful in prediction of asset price movements (Bianchi et al. 2020). Embracing market anomalies as possible predictors of excess returns has also been studied with promising results (Kozak et al. 2020; Dong et al. 2021). Geertsema and Lu (2020), through unsupervised Hierarchical Clustering, clustered anomalies and tested 41 factors to identify the ones that can explain all of them concluding with nine factors that scored higher in the Sharp ratio.

In most publications, supervised models are employed for both regression and classification problems with Neural Networks being the most common choice. The results of the aforementioned studies indicate an outperformance of ML over simple regression and other traditional models, due to their ability to handle high dimensional non-linear data (Table 5).

Table 5 Cluster topics

3.2.8 Cluster distribution per year

To gain more insights on the academic focus on ML in A&F, we conduct clustering distribution per year. Early access publications and articles not included in the main clusters are omitted. During the first years of experimentation with ML, we find that time-series topic was of sole interest in academic research. Bankruptcy prediction and textual analysis have also been introduced since 2006 while interest in risk management sparked after a decade. In 2021 all clusters increased significantly while in 2022 we find a slowdown for option and limit order trading with 11.5% decrease and a modest 14.28% increase for asset pricing in comparison to other groups. Noticeably, during the last 3 years risk management had 520% increase, textual analysis 300% and time-series forecasts 245% (Table 6).

Table 6 Cluster evolution over time

3.2.9 Cluster conclusions

The six clusters identified the plethora of problems where ML can be applied, either in conjunction with traditional models or independently. Their ability to process high dimensional, non-linear data are among the top factors for promoting them as good candidates in A&F discipline. Supervised models are the most frequently applied indicating that regression and classification problems types are the most common. We also find 72 clustering applications while reinforcement learning is barely employed. In Table 7, cluster key points are summarized. Lastly, our initial analysis indicated 2 smaller clusters composed of 4 and 2 papers respectively that we refrain from analyzing due to their size.

Table 7 Summary of cluster challenges and solutions

3.3 (RQ3) What are the future avenues of ML in A&F research?

Foremost, the analysis of clustering distribution over time provides valuable insights into research trends that elaborate on addressing our third research question. To further enhance our understanding of future avenues, we employ co-word analysis through VOSviewer. This method facilitates the visualization of interconnection between common words and importantly their average year of occurrence. For our purpose, following similar studies (e.g., Burton et al. 2020; Rojas-Lamorena et al. 2022) we select author specified keywords identified at least four times in our corpus thus focusing on the most important topics. As Fig. 7 depicts, each keyword is assigned a color indicating the average year of occurrence; darker color palette denotes years prior to 2019 while vibrant colors relate to last couple of years. Our results reveal both ML algorithms application and topics progression that we analyze under three periods. Given the number of publications in the early years, we observe that only since 2017 author-specified keywords occur at least four times.

Fig. 7
figure 7

Co-Word Analysis of author specified keywords

Starting with the first period, until 2018, the key focus of research can be categorized under market returns and crisis prediction. Topic keywords during this period are algorithmic trading (Allen and Karjalainen 1999; Creamer 2012; Cont and Kukanov 2016), stock returns (Constantinou et al. 2006), exchange rate (Amat et al. 2018), implied volatility (Bekiros and Georgoutsos 2008; Manela and Moreira 2017), early warning systems (Joy et al. 2016; Alessi and Detken 2018), and banking crises (Duttagupta and Cashin 2011; Alessi and Detken 2018). Moreover, we notice a plethora of clustering and classification applications during this timeframe where Genetic Programming (Payne and Tresl 2014; Karathanasopoulos et al. 2015), Neural Networks (Fioramanti 2008; Chen et al. 2013) and boosting algorithms (Creamer and Freund 2010; Creamer 2015) were widely adopted.

Between 2019 and 2020, publications related to time-series forecasting (Wiese et al. 2020), asset pricing (Calomiris and Mamaysky 2019; Weigand 2019) and data mining (Anouze and Bou-Hamad 2019) were introduced. Volatility and risk remain prominent themes in the literature, as indicated by keywords such as credit ratings (Abedin et al. 2019; Xu et al. 2019), systemic risks (Arakelian et al. 2019; Dungey et al. 2020), peer-to-peer lending (Jagtiani and Lemieux 2019; Zanin 2020) and financial distress (Gkillas et al. 2020; Samitas et al. 2020). Regarding ML keywords, Neural Networks (Mäkinen et al. 2019; Sun 2019) and bagging algorithms (Pace and Hayunga 2019) are proposed while, although limited, reinforcement learning emerged (Buehler et al. 2019; Wang and Zhou 2020).

During the last couple of years, research has shifted in topics driven by technological breakthroughs. Cryptocurrency markets (Anghel 2021; El Montasser et al. 2022), big data (Xue et al. 2021; Obaid and Pukthuanthong 2022), fintech (Han and Kim 2021b), textual analysis (Ongsakul et al. 2021; Aziz et al. 2022) and sentiment extraction (Liu et al. 2021; Obaid and Pukthuanthong 2022) are advocates of this trend. Publications related to Covid-19 pandemic and its impact on financial markets (Guo et al. 2021) and banks response (Talbot and Ordonez-Ponce 2022) is also a main consideration of academic research during last two years. Traditional themes are also explored such as portfolio management (Mahmoudi et al. 2021; Pun and Wang 2021), realized volatility (Engle et al. 2021; Lu et al. 2022), and option pricing (Chataigner et al. 2021; Bayer et al. 2022). In addition to Neural Networks and deep learning (Jiang et al. 2022), recent research expanded by incorporating additional ML models such as Random Forest (Laborda and Olmo 2021; Ghosh et al. 2022), SVM (Petridis et al. 2022) and LASSO (Devriendt et al. 2021; Shahzad et al. 2022).

From the co-words analysis, we observe that volatility, risk management, and price forecasting are predominant topics throughout our selected timeframe, suggesting the potential continuation of this trend in the near future. Neural Networks are expected to maintain a central position in research; however, we anticipate the emergence of specialized algorithms tailored for specific tasks, departing from the prevalent broader-purpose algorithms. Our observation is based on the increased number of publications on journals such as Quantitative Finance which excel in this field e.g. (Horvath et al. 2021; Kim et al. 2022). Furthermore, our analysis in RQ1 indicates a shift in the application of ML methodologies from the US market towards the Asian and global markets, indicating an evolving landscape for research focus.

As shown above, ML has multiple advantages over traditional methods. However, there is still a debate on how it can be adapted to current strict regulations on A&F and how it can be implemented given the lack of literacy on data analysis of the existing workforce in those fields. We believe that further research will be conducted to enhance the traceability of decision-making of the algorithms as well as to identify the actions needed to support new or updated regulatory requirements. We argue that practitioners in A&F should be more technology inclined and be able to work alongside advanced automation tools to enhance decision-making capabilities.

4 Challenges and opportunities in advancing applications of ML

As we have seen in the previous sections, ML models have gained significant popularity in the recent years. However, researchers should be familiar with model’s shortcomings before employing them in any application, whether in the fields of A&F or any other field. In addition, there are also some common pitfalls that researchers should avoid. In this section we address the above issues as well as the most common approaches to overcome or alleviate them. Understanding these constraints is crucial for informed and effective utilisation of these models in practical scenarios in both A&F. In addition, we present various ideas for future research in these areas.

4.1 Can we enhance the transparency of ML approach?

Enhancing transparency in ML models is a crucial concern, especially in fields like A&F, where interpretability is essential for regulatory compliance and risk assessment. While the term “Black Box” is commonly used in ML, it's worth noting that certain ML algorithms, such as Wavelet Networks (considered “grey boxes”) and Genetic Programming (considered “white boxes”), offer a degree of transparency.

Recently, various methods to improve model interpretability have been proposed, such as SHAP (SHapley Additive exPlanations), LIME (Local Interpretable Model-agnostic Explanations), and surrogate models. In our literature review, SHAP stands out as the most frequently employed approach. It is based on cooperative game theory to allocate contribution of each feature to the prediction (Lundberg and Lee 2017). LIME is used for instance-level interpretations, and it uses a simple model to approximate a complex one (Ribeiro et al. 2016). A global surrogate model is a simple model that approximates a black-box model. Other common approaches dedicated to CNNs are Features Visualization, that uses neuron activation maximization to visualize learned features, and Network Dissection (Bau et al. 2017).

Future research could focus on developing novel techniques for model explainability and visualization tailored specifically to A&F applications. This might involve creating domain-specific interpretability tools that financial analysts and auditors can use effectively (Bertomeu 2020). A clear and transparent decision process for financial managers, accountants and auditors will further enhance the adaptation and acceptance of ML by practitioners.

4.2 Can we conduct any statistical inference and hypothesis test based on ML?

The majority of the papers that employ ML methods focus on point forecasts while statistical inference and prediction intervals are overlooked. However, financial managers are also interested in prediction intervals, statistical inference, and hypothesis testing. Traditional statistical methods often rely on assumptions like normality, linearity, and independence that may not hold in ML models, (Alexandridis and Zapranis 2014). However, recent research has shown ways to conduct hypothesis tests and statistical inference with ML. Techniques like permutation tests and bootstrap methods can be applied to assess the significance of model features or compare different models (Efron and Tibshirani 1994; Zapranis and Refenes 1999). To our knowledge, the only approach to a complete statistical framework is presented in Alexandridis and Zapranis (2013) and Alexandridis and Zapranis (2014). The authors provide a complete model identification framework for a class of Neural Networks called Wavelet Networks, but the methodology is applicable to every family of Neural Networks. More precisely, a model selection procedure as well as a statistical variable significance and a statistical variable selection framework are presented. The methodology is based on a Sensitivity Based Pruning criterion and bootstrap techniques. Furthermore, the authors provide a framework for prediction intervals based on the Bagging and Balancing techniques. The drawback of the previous algorithms is that are based on bootstrapping techniques and hence they are time consuming. Future work could focus on refining these statistical techniques to make them more applicable to ML in A&F. Additionally, exploring the theoretical underpinnings of statistical properties of ML models could provide further insights.

4.3 How ML can help decision making in social science?

ML can be applied to data analysis and pattern recognition, identifying relationships between variables to gain deeper insights into human behaviour, societal trends, and interactions, thereby informing more robust decision-making (Lv et al. 2020). Similarly, ML can be used in predictive modelling in fields such as economic forecasting, crime prediction, and disease outbreak modelling, guiding policymakers in making informed decisions (Kleinberg et al. 2015). Text and sentiment analysis can be employed to analyse social media posts, surveys, reviews, and other textual data to understand public perceptions, attitudes, and sentiments toward various social issues, policies, products, and services, ultimately shaping decision-making strategies (Shrestha et al. 2021). Personalisation and recommender systems can be used in areas like education (suggesting courses), healthcare (recommending treatment plans), and policy-making (tailoring interventions to specific groups) (Sarker 2021).

ML can also contribute to casual inference to identify how one variable can affect another, to network analysis to identify influential nodes community structures, and to information flow patterns, aiding the understanding of social dynamics and communication patterns and policy analysis and simulation to assess potential outcomes before implementing policies, helping to make more informed decisions and reduce unintended consequences (Sarker 2021). In the areas of healthcare and social services sector, ML aids in resources allocation by predicting demand, identifying high-risk populations, and suggesting personalised interventions (Hoffman and Podgurski 2019). Finally, ML could be employed in ethics and bias detection to help identify biases in datasets and models, promoting fairness and ethical considerations in decision-making processes. It can also help in identifying potential discriminatory outcomes of certain policies or interventions (Di Maggio et al. 2022).

Successful application of ML in social science, requires careful consideration of ethical and privacy issues, as well as transparency and interpretability in order to avoid any biases. As it shown in Di Maggio et al. (2022) minorities were more likely to either be denied credit or be granted credit on unfavourable terms while Hoffman and Podgurski (2019) reports similar algorithmic discrimination issues in health care. Collaboration between domain experts and data scientists is crucial to ensure that ML techniques are applied in a responsible and meaningful way in social science research and decision-making.

Future research should aim to improve model accuracy and robustness in social science applications. Furthermore, it should aim to develop ethical and responsible AI practices to address potential biases in data. Finally, should aim to foster interdisciplinary collaborations between ML experts and social scientists to ensure the relevance and validity of models.

4.4 How can we control the overfitting and over-parameterization issues?

One of the most crucial steps in ML is to identify the correct topology of the model. For example, in Neural Networks a desired architecture should contain as few hidden units (HUs) (or neurons) as necessary while at the same time it should explain as much variability of the training data as possible. A network with less HUs than needed would not be able to learn the underlying function while selecting more HUs than needed will result to an over-fitted model.

The usual approaches proposed in the literature are the early stopping, regularization, Bayesian regularization, L1 and L2 regularization, brute force pruning and irrelevant connection elimination and are commonly used to mitigate these issues. Some other prevention practices include feature pruning by eliminating irrelevant or non-significant dimensions for selected dataset, embedding additional distinct cases in training set and data augmentation in which models separate important features from noise. Some ML models include prevention mechanisms such as Boosting and Bagging models, as well as random node dropouts for Neural Networks.

The previous methods do not use an optimal architecture of a model. A very large model is used and then various methods were developed to avoid over-fitting. Smaller networks usually are faster to train and need less computational power to build (Reed 1993). Detection tools take into account the accuracy difference between the training and the validation sample. The training stops when the error in the validation sample increases. Other methods include bootstrapping or k-fold cross-validation techniques to improve the generalisation of the model. Others propose ad-hoc rules like the observations to parameters ratio should be over a specific number, e.g. five. This is similar to usually requiring around 30 observations for linear models. Zapranis and Refenes (1999) and Alexandridis and Zapranis (2013) propose the Minimum Prediction Risk criterion for the optimal selection of neurons in a Neural Network.

Future research can explore novel regularization techniques tailored to A&F data, which often have specific characteristics such as time-series dependencies, imbalanced classes, or high dimensionality.

4.5 To what extent is ML sensitive to training data? Is there any robustness?

ML models can be sensitive to training data especially in cases of small or unrepresentative datasets. This sensitivity is not confined to any specific model type; it applies universally, whether the models are linear or nonlinear, parametric or nonparametric, when the sample used for model training fails to adequately represent the overall population. Moreover, in classification problems, test set population ideally should be chosen to be evenly split between groups, otherwise careful consideration is required in order to establish the appropriate cut-off point for the classifier to ensure robust and accurate classification.

Another common issue in the literature where ML is applied in A&F is the omission of the data pre-processing step. Trend and periodicities should be removed from the data, and appropriate techniques to treat outliers should be applied. Finally, as in the case of statistical models, ML models are also affected by structural breaks in the data.

Researchers can assess robustness by performing sensitivity analyses, using techniques like dropout, and applying adversarial testing. Another approach is to create bootstrapped versions of the training sample, train a different model on each sample and then use an amalgamation of the predictions. Finally, adaptive models have been applied to update on-line the architecture of ML models to account for structural breaks or jumps in the data or any other change in the data generating process (e.g., Cao and Tay 2003; Lin et al. 2006).

Achieving complete robustness in all situations is often a challenging and ongoing process. Addressing issues such as data quality, bias, and data drift can help make ML models more resilient to variations in input data. Future work could involve developing techniques for training models that are inherently more robust to variations in training data distribution and exploring methods for model robustness evaluation.

4.6 How researchers fine-tune the hyperparameters in A&F data?

Fine-tuning hyperparameters in ML models is often done using techniques like grid search, random search, or Bayesian optimization. Usually, these approaches are coupled with sampling techniques like bootstrap or cross-validation. However, domain-specific knowledge is crucial in determining appropriate parameter settings.

Future research could focus on automated hyperparameter tuning methods tailored to A&F datasets. Additionally, developing domain-specific guidelines for hyperparameter tuning can help researchers and practitioners navigate this process effectively.

5 Conclusions

ML offers several advantages over traditional methods currently employed in A&F, including the extraction of features, pattern recognition, processing of high-dimensional data and handling nonlinearity. Our paper sheds light on the current state of research and applications in ML while also suggesting new paths for further investigation. Through both literature review and bibliographic coupling, we explored three research questions.

In our first research question, we identified a surge in this research field since 2015 which continues up to date. ML applications in finance constitute 88.38% of our corpus, while 16.35% of papers are published in the Quantitative Finance journal. Among the 67 countries associated with our research, Austria and the United States of America emerged as the most cited in contrast to Japan and Thailand that rank at the bottom of the list.

In the second research question, we constructed six clusters through bibliographic coupling and analyzed them using a Bag-of-words technique and literature review. This leads us to conclusions about current challenges, key ML algorithms proposed, and the evolution that this new technology is bringing about. The strongest assets of the new models lie in their ability to handle multi-dimensionality, non-linearity and multiple sources of information such as images and textual analysis. Neural networks, SVM, and tree based algorithms proved to be effective in a plethora of applications as long as enough training data are at hand; otherwise out of sample predictions may exhibit lower accuracy than traditional models. The most commonly employed models are supervised, while unsupervised models are predominantly used for clustering and topic extraction using the LDA algorithm. Notably, in the past three years, there has been extensive exploration in topics associated with risk management, textual analysis, and time-series forecasts.

To address our third research question, we have conducted a co-word analysis on author keywords, revealing the exploration of volatility, risk management, and price forecasting since the early stages of academic research in ML within the A&F discipline. Furthermore, we tried to identify the future direction of research in the fields of ML and A&F. Our results indicate that the trend in the above topics will continue while we expect the development of more advanced methodologies that are also more tailored to specific applications in the area of A&F. Finally, we examine analytically the limitations of ML algorithms. We present the most common approaches proposed in the literature to alleviate these issues, where possible, as well as directions for future research in these areas.