1 Introduction

Gold has been an important precious metal for centuries. It is a major financial asset for countries and a key component of the global monetary reserves for trading and currency hedging (Capie et al., 2005; Wen et al., 2017). It also plays a prominent role in investments, especially to hedge against adverse financial events. Indeed, in times of financial turmoil leading stock indices to decrease, the prices of precious metals tend to move in the opposite direction. Furthermore, forecasting gold price fluctuations is a crucial issue for investors, for mining projects and related companies and in general for any agent who sees gold as an indicator of the future performance of the world economy.

Up to now, several studies have been performed to predict the commodity price as gold price. There are three main categories of prediction methods: (1) classical methods (2) artificial intelligence methods (3) hybrid approaches. First, traditional mathematical models such as Autoregressive Integrated Moving Average (Parisi et al., 2008), jump and dip diffusion (Shafiee & Topal, 2010a, 2010b) and the multi-linear regression (Escribano & Granger, 1998; Kearney & Lombra, 2009) have been used for gold price forecasting. Those classical methods mostly describe the linear relationship between variables through specific ex ante analytical formulation. Second, with the recognition of the nonlinear and complex characteristics of the gold price system (Alameer et al., 2019), intelligent models, especially artificial neural networks have been useful to predict volatile financial variables, which are quite difficult to guess with classical statistics and econometrics. They are one of the most important types of machine learning models, which have been introduced and examined for forecasting commodity prices (Khashei & Bijari, 2010; Lineesh et al., 2010; Parisi et al., 2008). The characteristics of artificial intelligence methods that make them appropriate for prediction are their nonlinear structure, flexibility and data-driven learning process. Third, the hybrid approaches are a combination of artificial neural networks and several models to forecast the fluctuations of a variable such commodity prices, market returns, etc. For example, Kristjanpoller and Hernandez (2017) proposed a hybrid model (hybrid ANN-GARCH) to predict the volatility of the spot prices of gold, silver, and copper. To forecast long term gold price fluctuations, Alameer et al. (2019) used a whale optimization algorithm as a trainer to learn the multilayer perceptron neural network.

The main novelty of this study is twofold. It assesses the performance of machine learning models to show the success of XGBoost in forecasting the gold price. Then it is the first, to the best of our knowledge, to analyze the importance of individual features of gold price fluctuation using SHAP (SHapley Additive exPlanation). Gold price has significant nonlinear, time-varying, many influence factors in consequence it is especially important to detect the most influential factors first and then to combine them in order to improve prediction accuracy. Other studies aim to accomplish this task by using hybrid models: one kind of method is used to detect factors and machine learning to combine them. For instance, Chen and Zhang (2019) use a Projection Pursuit (PP) algorithm for the factors and Neural Network for prediction.

This paper makes two major contributions to the literature. The first is beginning the forecast with six machine learning models and to compare their performance in the gold price prediction. Those models are linear regression, neural networks, random forest, and three gradient methods based on decision trees that are Light Gradient Boosting Machine (LightGBM), CatBoost algorithm and eXtreme Gradient Boosting (XGBoost). To the best of our knowledge, this research represents the first attempt to use CatBoost, LightGBM and XGBoost models for forecasting gold price fluctuations. The best fit model is identified according to the performance criteria including coefficient of determination (R2), mean absolute error (MAE), and root mean square error (RMSE). We show that XGBoost performs better than other machine learning techniques to predict the gold price. We provide evidence of the benefits of using artificial intelligence to improve forecasting and more specifically show that XGBoost is a successful forecasting procedure. Second, in terms of model interpretation—which is especially important when using machine learning models that are often difficult to interpret—several studies have started to take advantage of SHAP (Ribeiro et al., 2016; Štrumbelj and Kononenko, 2014). We have applied SHAP for the first time. To the best of our knowledge, SHAP interaction values have not yet been applied to analyze financial data set. Several studies have shown that gold prices may be affected by many predictors, such as inflation (Alameer et al., 2019; Beckmann & Czudaj, 2013), currencies (Beckmann & Czudaj, 2013; Kristjanpoller & Minutolo, 2016), metals (Bhatia et al., 2018; Schweikert, 2018), crude oil (Elie et al., 2019; Kanjilal & Ghosh, 2017; Sephton & Mann, 2018) and exchange rates (Akbar et al., 2019; Singhal et al., 2019). The SHAP method allows assigning each factor an importance value for gold price prediction. Applying methods based on explanations for a complex model to interpret gold price forecasts is of great interest as it allows understanding how the model behaves.

The rest of this paper is structured as follows. Section 2 provides a literature review of the time series forecasting models and factors that influence gold prices. Section 3 provides a description of the data. Section 4 presents the methodology and summarizes six machine learning models to forecast gold price and the method to interpret the predictions generated by these complex models. The results obtained are discussed in Sect. 5. Finally, the conclusion is put forward in Sect. 6.

2 Related literature

In order to improve the quality of gold price predictions, our study examines two important steps of a prediction process: first, the choice of the most relevant input variables and second, the choice of the best statistical model. Even if no consensus has emerged on which macroeconomic variables should be taken into account as primary drivers of gold-price fluctuations, empirical studies often use a common set of variables such as exchange rates, precious metals and mineral commodities prices, oil prices, and inflation. In their survey, O’Connor et al. (2015) show that gold prices have a broad range of predictors, covering the fields of commodities, financial variables, macroeconomic data, and interest rates.

The literature has demonstrated that exchange rates have a strong power for forecasting commodity prices (Chen et al., 2010) and especially gold-price fluctuations (Pukthuanthong & Roll, 2011; Reboredo, 2013). Bodart et al. (2015) provided evidence of the relationship between exchange rates and commodity prices for developing countries that export these commodities. Ciner (2017) confirmed that the exchange rate of South African rand has a significant predictive power to forecast palladium and platinum prices, and to a lesser degree silver prices. Sari et al. (2010) argued that precious metals respond to any shock in the exchange rate or shock in the prices of other precious metals.

In effect, the relationship between gold and other precious metals is complex. Studies report a co-integration relationship between gold and silver and the role of financial crises, but empirical studies do not reach a consensus concerning the direction of causality. Bhatia et al. (2018), in contrast to Sensoy (2013), show the existence of two-way causality among precious metals. Studies also found that there are cointegrating relationships between prices of mineral commodities (Kucher & McCoskey, 2017; Liu et al., 2019; Roberts, 2009; Rossen, 2015; Wu & Hu, 2016; Yue et al., 2015). Batten et al. (2015) report evidence of time-varying spillover effects between precious metal prices, which can be interpreted as evidence of time-varying market integration.

Several studies have confirmed that, because oil still being one of the most popular sources of energy used, crude oil prices are the leading cause of commodity price volatility (Abd Elaziz et al., 2019; He et al., 2010; Lardic & Mignon, 2008; Shafiee & Topal, 2010a, 2010b). Behmiri and Manera (2015) examined the influence of oil price shocks on metal (including gold and silver) price volatility. They found that the price volatility of these metals has been influenced by oil price shocks. The oil price is considered as one (with inflation) of the main macroeconomic variables that influence the gold price (Batten et al., 2010; Tully & Lucey, 2007). It is generally recognized that there is a positive correlation between gold and crude oil prices (Teetranont et al., 2018). That is also the case in a recent study of Mo et al., (2018) that explored the dynamic linkages between the USD and the gold and crude oil markets. Bedoui et al. (2019) also found that gold, oil and USD exchange rates have strong connections during a period of crisis. Furthermore, Cologni and Manera (2008) observed that the rising oil prices increase metal prices by the inflation effects.

In effect, inflation is another major macroeconomic variable that influences gold prices (Batten et al., 2010; Fortune, 1987; Mahdavi & Zhou, 1997). Following Shafiee and Topal (2010a, 2010b), the two most important variables that explain gold price behavior are the oil price and the inflation rate. Correlation between gold price and inflation rate is negative (Shafiee & Topal, 2010a, 2010b). In a recent study, Alameer et al. (2019) indicate that all the previously presented variables (crude oil, iron, silver, and copper prices, exchange rates and inflation rates of US and China) have a high forecasting power on gold prices. Finally, it is also well known that precious metals play a prominent role especially to hedge against adverse financial events, the so-called “safe haven” hypothesis in the literature (Baur & Lucey, 2010; Baur & McDermott, 2010). Kang et al. (2017) demonstrate that gold and silver could apparently also benefit from a flight-to-quality phenomenon during financial crises. It is the reason why authors introduced market indices to explain gold price fluctuations (Liu et al., 2017; Pierdzioch et al. 2016; Kristjanpoller & Minutolo, 2015). Akbar et al. (2019) in Pakistan and Singhal et al. (2019) in Mexico demonstrate the same effect by studying the dynamic relationships among gold price, stock price index and exchange rate. In very recent studies (Risse, 2019; Zhang & Ci, 2020) combine CPI of U.S., the federal funds rate, crude oil future price, the nominal effective exchange rate and the Dow Jones index, as inputs for gold price forecasting.

Numerous statistical approaches have already been used to predict gold price fluctuations. Artificial neural networks (NN) are one of the most important types of machine learning methods, which have been examined for forecasting commodity prices (Khashei & Bijari, 2010; Lineesh et al., 2010; Parisi et al., 2008). Recent studies have combined Artificial Neural Networks with other machine learning approaches in order to improve prediction efficiency. Ramyar and Kianfar (2017) demonstrated the superiority of the MLP neural network compared to the vector autoregressive model (VAR) to forecast crude oil prices. Alameer et al. (2019) compared to a recent meta-heuristic method called whale optimization algorithm (WOA) as a trainer to learn the multilayer perceptron neural network to other models, including the classic NN, particle swarm optimization for NN (PSO–NN), genetic algorithm for NN (GA–NN), and grey wolf optimization for NN (GWO–NN). Deep learning is another approach to improve the predicting ability of traditional ANN. Deep learning algorithms has three main advantages: they improve the speed of network training, they avoid being trapped in local minima and they solve the multi-layer network training problems (Sezer et al., 2020). Recently, these methods have been used in time series forecasting.

Zheng et al. (2019) proposed an improved Deep Belief Networks (DBN) for forecasting exchange rates and determined that it worked better than traditional methods. Zhang and Ci (2020) show the superiority of a DBN model (compared to ARIMA or classical NN), composed of restricted Boltzmann machines (RBM) for pre-training and a layer of supervised back-propagation (BP) for fine-tuning, in gold price forecasting. Chen et al. (2020) went a step further in this direction by using extreme learning machine (ELM) for time series forecasting and specifically gold price. Compared with most of the other machine learning algorithms such as support vector machine and deep learning method, ELM boast a faster learning speed.

Besides linear regression and NN models, we have chosen to implement trees boosting methods. Those hybrid models are able to overcome the limitations of the individual models and increase the forecasting accuracy by combining the advantages of both linear and nonlinear models (Khashei & Bijari, 2011). Hybrid models have already been used to predict gold price: Wen et al. (2017) used ensemble empirical mode decomposition (EEMD), SVM and ANN to analyze and predict the gold price series, Herawati et al. (2017) utilized a traditional recurrent neural network (RNN) and EEMD, Zhu and Zhang (2018) developed a hybrid model using an ANN, principal component analysis (PCA) and genetic algorithm (GA) Hybridization of times series analysis methods and machine learning has also been used to predict gold prices. Several papers (see Kumar, 2018 for a recent example and the references) propose to combine of ARIMA (for the linear part of the time series) and ANN (for the nonlinear component), more original is the combination of discrete wavelet transform with support vector machine made by Risse (2019). Very recently, Du et al. (2020) combine ELM and hybrid approaches to analyze the traits of metal prices. On the whole, these aforementioned scholars have confirmed the higher prediction performance of hybrid models than that of the individual methods.

The main idea of trees boosting methods is to combine decision tree methods represented in our paper by random forests and gradient boosted methods. The interest for random forests remain important, for instance in a very recent paper, Pierdzioch and Risse (2020) use multivariate random to forecasts of a vector of returns of four precious metal prices (gold, silver, platinum, and palladium). They show that multivariate forecasts are more accurate than univariate forecasts. Gradient represents the slope of the loss function, so if gradients are large in some points, it means that these points are important for finding the optimal split point. Recent research (Pierdzioch et al., 2015a, b) already uses the boosting approach to study the determinants of returns of the price of gold. The combination of decision trees and gradient boosting methods has the advantages of good training effect and not easily over-fitting. Specifically, in this paper, we will compare XGBoost, with two very recent boosting methods: LightGBM and CatBoost. These algorithms use different splitting methods in order to increase learning speed, prevent overfitting and improve performance.Footnote 1 They have already been used to analyze financial data sets (Basak et al., 2019; Ma et al., 2018; Huck, 2019) but never to predict commodity price fluctuations.

In practical financial decision situations, decision makers not only need to make accurate predictions, they also have to justify how predictions are obtained and why a given decision is taken (for instance to refuse a loan based on the financial status of the borrower; or to buy/sell a financial asset, in our example gold). In the academic world, it is also important to understand the causal relationships between variables and the hierarchy of causes. To answer these practical and theoretical concerns, we use SHAP to explain the output of the machine learning model. The idea of SHAP is to show the contribution of each feature to run the model output from the base value of explanatory variables to the model output value. In short, SHAP values represent a feature's responsibility for a change in the model output. Features pushing the prediction higher are distinguished from those pushing the prediction lower. To our knowledge, SHAP interaction values have not yet been applied to analyze financial data set.

3 Data and variables

In this paper, we investigate the effect of several explanatory variables on gold price, which is given in US dollars. The data covers the period from January 1986 to December 2019, including 408 monthly observations. This study has been divided into training (80%) and test (20%) samples in order to compare the performances of different machine learning models. We randomly partition the dataset by selecting 80% of the data as the training data set and the remaining 20% as the testing set. This method is commonly used in many of the previous research (e.g. Abellán & Mantas, 2014; Antunes et al., 2017; Ben Jabeur et al, 2020). Also, Gholamy et al., (2018) show that that the best results are obtained if we use 20–30% of the data for testing, and the remaining 70–80% of the data for training. Table 1 gives more information about data and variables used in this study. Table 2 provides descriptive statistics of time series data and Table 3 presents the correlation matrix between variables. It is clear that gold price is significantly correlated with all the predictor variables.

Table 1 Data and variables
Table 2 Descriptive statistics
Table 3 Correlation matrix

4 Methodology

In this section, we present six machine learning models to forecast the gold price. We describe the metrics that can be used to evaluate their performance. We also present the SHAP approach to interpret the results provided by machine learning models.

4.1 Machine learning models

4.1.1 Linear regression

Linear regression is a statistical analysis that analyzes the effect of selected independent variablesFootnote 2 on an explanatory dependent variable. Linear regression uses the ordinary least squares method to estimate the linear relationship between variables. The function of forecasting models are expressed as follows:

$$ Y_{t} = \beta _{0} + \beta _{1} X_{1} + \ldots + \beta _{n} X_{n} + \varepsilon _{t} $$

where Yt is the expected value at time t, Xt is a vector of k predictors variables at time t, βj is the estimated coefficients, and εt is a random error term at time t.

Several previous studies have shown that linear regression is less accurate in forecasting compared to advanced methods (Risse, 2019) and suffers from various statistical restrictions, such as endogeneity and multicollinearity (Baker et al., 2020; Pesaran & Smith, 2019).

4.1.2 Neural networks

Artificial neural networks are widely used in forecasting commodity prices (Ewees et al., 2020; Kristjanpoller & Minutolo, 2015, 2016). Neural networks (NN) are a set of formal neurons associated with layers and operating in parallel. In a network, each subgroup processes independently from the others and transmits the result of its analysis to the next subgroup. The first layer is called the input layer. It will receive the source data that we want to use for the analysis. Its size is therefore directly determined by the number of input variables. The second layer is a hidden layer, in the sense that it has only an intrinsic utility for the network and has no direct contact with the outside. The third layer is called the output layer. It gives the result obtained after compilation by the network of the data entered in the first layer. Each neuron collects the information provided by the neurons of the previous layer and then calculates its activation potential. This potential is then transformed by a function to determine the pulse sent to the neurons of the next layer. The output of the hidden layer is calculated as follows:

$$ Y_{t} = \frac{1}{{\left[ {1 + e^{{ - (\mathop \sum \nolimits_{{i = 1}}^{N} w_{{ti}} x_{j} - b_{j} )}} } \right]}} $$

where wti is the weight of the ith hidden neuron and bj is the base of the second hidden layer. Multi-layer perceptrons are the most commonly used models. Several processing layers allow them to realize non-linear relationships between input and output.

4.1.3 Random forest

Recently, several studies have shown the effectiveness of random forest regression (RF) in economics and finance (Krauss et al., 2017; Loureiro et al., 2018; Mercadier & Lardy, 2019). RF is a tree-based regression approach. According to Babar et al. (2020), RF has been extensively used in recent years due to its robust performance compared to other traditional models. It is an ensemble learning framework proposed by Breiman (2001), which is built on the association of a multitude of regression trees. RF can be constructed by randomly sampling a feature subgroup for each decision tree. Based on bootstrap sampling, the generation of a random subset of each base tree model and the pruning of linear nodes is created on the same sample. After the training process, the predicted values of gold price are expressed as follows:

$$ Y_{t} = \frac{1}{T}\mathop \sum \limits_{{h = 1}}^{T} l_{k} \left( x \right) $$

where lk(x) is a set of kth learner random tree learners and X is the vector of T input variables. The trees are constructed using the binary recursive partitioning.

4.1.4 Light gradient boosting machine (LightGBM)

LightGBM is a novel gradient boosting framework proposed by Ke et al. (2017). It employs gradient-based one-side sampling to fix the split point via computing variance gain. The LightGBM algorithm built two novel approaches, which are the gradient-based one-side sampling and the exclusive feature bundling (Sun et al., 2019). The estimated function of LightGBM integrates a number of T regression trees and defined as follows:

$$ Y_{t} = \mathop \sum \limits_{{h = 1}}^{T} f_{t} \left( x \right) $$

where ft(x) denotes the regression trees. In LightGBM, Newton's method was used to estimate the objective function.

Several studies showed that the LightGBM provides more efficient and accurate performance than advanced machine learning algorithms. According to Sun et al. (2019), the advantages of LightGBM can be reflected in fast training speed, low memory consumption and good model accuracy.

4.1.5 CatBoost algorithm

CatBoost is also a new gradient descent algorithm proposed recently by Prokhorenkova et al. (2018). This supervised machine learning algorithm consists in classifying categorical data using the gradient boosting on decision trees. The decision tree is created by dividing the training data set into similar instances. According to Prokhorenkova et al. (2018), CatBoost uses ordered boosting and an innovative algorithm for processing categorical features. CatBoost outperforming other boosting techniques in terms of performances. The function of decision tree h can be written as:

$$ h^{t} = \arg min\frac{1}{N}\sum ( - f^{t} \left( {X_{k} ,Y_{k} } \right) - h\left( {X_{k} } \right))^{2} $$

where Xk is the random vector of N input variables, Yk is the outcome, and f function is a least squares approximation by the Newton method. Moreover, CatBoost uses oblivious decisions in order to improve the efficiency, to enhance execution speed and to solve the problem of over-fit.

4.1.6 XGBoost algorithm

Recently, XGBoost has been utilized in various disciplines, such as, energy (De Clercq et al., 2020; Ma et al., 2020), health care (Guo et al., 2019; Singh et al., 2019), and credit scoring (Jiang et al., 2019; Xia et al., 2017). XGBoost, developed by Chen and Guestrin (2016), is an algorithm that incorporates the boosting model proposed by Friedman (2001). Normalization is used in the objective function to reduce model complexity, to prevent overfitting and to make the learning process faster. Importantly, XGBoost is an ensemble model which consists of an efficient implementation of decision trees, in order to produce a combined model whose predictive performance is better than individual techniques used alone. According to Mo et al. (2019), the output function is calculated as follows:

$$ \hat{Y}_{i}^{T} = \mathop \sum \limits_{{k = 1}}^{T} f_{k} \left( {x_{i} } \right) = \hat{y}_{i}^{{T - 1}} + f_{T} \left( {x_{i} } \right)~ $$

where \(\hat{Y}_{i}^{{T - 1}}\) is the generated tree, \(f_{T} \left( {x_{i} } \right)\) is the newly created tree model, and T is the total number of tree models.

In XGBoost, several parameters need to be regulated to maximize the power of model performance and to prevent overfitting problems. Cross-validation technique has been used to find the optimal combination of parameters. In our study, for parameter tuning we use tenfold cross-validation. The optimal hyper-parameters values selected after cross-validation are col sample by tree: 0.7; learning rate: 0.05; number of iterations: 500, max depth: 5, subsample: 0.7; minimum sum for instance weight needed in a child: 4; number of parallel threads: 4, and silent: 1.

4.2 Performance metrics of models

The forecasting performance of the six machine learning models is evaluated by five common evaluation metrics: the root mean square error (RMSE), mean square error (MSE), mean absolute error (MAE), and the coefficient of determination (R2). The specific definition of each of these metrics can be calculated as follows:

$$ RMSE = \sqrt {\frac{1}{N}\mathop \sum \limits_{{h = 1}}^{N} (\hat{Y}_{h} - Y_{h} } )^{2} $$
$$ MSE = \frac{1}{N}\mathop \sum \limits_{{h = 1}}^{N} (\hat{Y}_{h} - Y_{h} )^{2} $$
$$ MAE = \frac{1}{N}\mathop \sum \limits_{{h = 1}}^{N} \left| {\hat{Y}_{h} - Y_{h} } \right| $$
$$ R^{2} = \frac{{\mathop \sum \nolimits_{{h = 1}}^{N} \left( {\hat{Y}_{h} - \bar{Y}_{h} } \right)^{2} }}{{\mathop \sum \nolimits_{{h = 1}}^{N} \left( {Y_{h} - \bar{Y}_{h} } \right)^{2} }} $$

4.3 SHAP (SHapley Additive exPlanation) approach for results interpretation

Machine learning has great potential in forecasting times series data. But researchers do not usually explain their predictions, which is a barrier to the adoption of machine learning. To overcome this problem, Lundberg and Lee (2017) proposed a SHAP approach for interpreting predictions for different techniques including LightGBM, NGBoost, CatBoost, XGBoost, and Scikit-learn tree models. SHAP helps users to interpret the predictions of complex models. SHAP was initially proposed by Shapley in 1953 and it is based on game theory (Shapley, 1953). It allows us to explain the prediction of a specific input (X) by calculating the impact of each feature to the prediction. The estimated Shapley value is calculated as follows:

$$ \widehat{{\phi _{j} }} = \frac{1}{K}\mathop \sum \limits_{{k = 1}}^{K} ((\hat{g}\left( {x_{{ + j}}^{m} } \right) - \hat{g}\left( {x_{{ - j}}^{m} } \right)) $$

where \(\widehat{g}\left({x}_{+j}^{m}\right)\) is the prediction for x, but with a random number of feature values.

Lundberg et al., (2018) proposed TreeSHAP, for gradient boosting models, among them XGBoost. TreeSHAP offers a rich visualization of each feature attribution that improves over classic feature importance and partial dependence plots. According to Lundberg et al., (2018) the TreeSHAP interaction values can be estimated as follows:

$$ \phi _{i} ,_{j} = \mathop \sum \limits_{{S \subseteq N\left\{ {i,j} \right\}}}^{{}} \frac{{\left| S \right|!(M - \left| S \right| - 2!}}{{2\left( {M - 1} \right)!}}\delta _{{ij}} \left( S \right) $$

when I when i ≠ j, \(\delta _{{ij}} \left( S \right) = f_{x} (S \cup \left\{ {i,j} \right\} - f_{x} (S \cup \left\{ i \right\} - f_{x} (S \cup \left\{ j \right\} + f_{x} \left( S \right)\), M is the number of features, and S all feature subsets. SHAP values advance our understanding of tree models by including feature importance, feature dependence plots, local explanations and summary plots.

5 Results analysis

5.1 Comparison of models’ performance

We have estimated model performance using the root mean square error (RMSE), mean square error (MSE), mean absolute error (MAE), and the coefficient of determination (R2). To compare the six machine learning models, a validation test sample (20%) has been used. The findings confirm the high capacity of explanatory variables to predict gold price, which R2 ranging from 0.807 for linear regression to 0.994 for XGBoost shown in Table 4.

Table 4 Machine learning models performance on testing dataset

Table 4 presents a comparison of the predictive capacity of the six models. The model with the lowest values of RMSE, MSE and MAE, and the highest value of R2 is considered the best forecasting model. As depicted in Table 4, XGBoost provides the highest R2 value coupled with the lowest RMSE, MSE and MAE among all the models used in this study. XGBoost is followed by CatBoost and RF, whereas linear regression and neural networks lead to the worst results. This indicates the advantage of the XGBoost over traditional forecasting techniques used in time series data. To assess the predictive power of our analysis, we have also illustrated the performance of different models in Fig. 1. This figure shows that the gold price forecasted using XGBoost is extremely near to reality for the test data. These findings are in line with the results of Xia et al. (2017) and Climent et al., (2019) who reported the performance of XGBoost compared to traditional models in bankruptcy prediction and credit scoring. Nerveless, the worst results of neural networks could be explained by the small size of the sample. According to Lago et al. (2018), deep learning models require large amounts of data to be correctly computed.

Fig. 1
figure 1

Performances of six machine learning models over test sample

5.2 Feature analysis

SHAP allows interpreting the effect of the influence of the input variables in the output. SHAP helps policymakers to interpret machine learning models; it calculates the variables’ importance.

Figure 2 displays the SHAP summary plot that orders variables based on their importance to affect the gold price. We can see that the silver price is the most important feature in the model. This result supports the findings in Schweikert (2018) who reported a strong dependence on the long-run relationship between gold and silver prices. Additionally, higher values of silver price result in higher SHAP values, which relate to a higher probability that gold price increases. Inflation is the next most important feature, in that order, and higher values of this variable correspond to higher chance that gold price increases. This is in line with the results of Kristjanpoller and Minutolo (2015) who documented that inflation is correlated with gold price fluctuations. In contrast, lower values of iron ore correspond to a higher chance of gold price increases. This finding is in contrast with Alameer et al., (2019), who found a positive relationship between iron ore and gold price. The relationship between gold price and crude oil price is particularly evident when higher gold price results in increased crude oil price. Again, this finding is consistent with the results of Kanjilal and Ghosh (2017) and Singhal et al. (2019), who documented that crude oil price is a major macroeconomic determinant to guide gold price movement. Finally, SHAP importance variables in Fig. 2 indicate that high values of gold price results in low values of S&P 500. This result is consistent with the findings reported by Piñeiro-Chousa et al. (2018), who found that gold returns and the S&P 500 index is negatively related.

Fig. 2
figure 2

On the left, SHAP summary plot of the XGBoost model. The higher SHAP value of a feature, the higher gold price levels. On the right, the relative importance for each feature, obtained by taking the average absolute value of the SHAP values

To further examine the relationship between features and the outcome, SHAP dependence plots show how a variable’s value impacts the prediction (y axis) of every observation in the dataset. We display SHAP dependence plots in Fig. 3. SHAP dependency plots may depict both the major impact of individual predictor variables as well as the interactions between them. By global interpretability, we can see on the whole sample, the positive or negative contribution of each feature to the prediction score. For example, in Fig. 3-d, we can examine the impact of China's exchange rate when the price of crude oil increases from 20 to 80. The red points display higher values of crude oil price, and the blue points represent lower ones, which reveal that increasing China’s exchange rate increases the volatility of gold price. When China’s exchange rate is low, SHAP values for high crude oil prices are above zero, which suggests that increasing China's exchange rate increases the gold price. In contrast, SHAP values for low crude oil price are under zero, which indicates that increasing China’s exchange rate while crude oil price is low reduces the chance of increasing gold price.

Fig. 3
figure 3

SHAP dependence plots. The x-axis is the value of the feature pressure and the y-axis is the SHAP value. The red values represent the high values of the variable, whereas the blue signifies the low values

Figure 3-g shows the effect of S&P 500 and inflation on the movements of gold price. The SHAP values indicate that the impact of S&P 500 starts positively; that is, increasing S&P 500 when it is below 30, results in higher gold price. Between 0 and  30, the linkage becomes negative, and an increase in S&P 500 causes a high gold price. In addition, high values of S&P 500 and inflation show that the gold price tends to be higher.

Let's see how SHAP can help us to obtain local knowledge. By local interpretability, we can measure how the feature values contribute to the prediction score of each observation in the sample separately. The traditional attribute importance algorithm, like linear models, only gives the importance value of a global attribute over the whole dataset, while SHAP value will provide importance value for each observation separately. Figure 4 shows the marginal effect of seven features on the predicted gold price level of the XGBoost model in the training dataset. In Fig. 4, the red color represents that the feature increased the gold price, while the blue color shows that the feature decreased the gold price. We have chosen the first observation prediction by XGBoost for illustration. The gold price prediction is equal to 834.32 significantly higher as the features have the following characteristics: China’s exchange rate (7.019), inflation (257.9) and S&P 500 (3.141) located in the blue zone, which indicates those variables driving the gold price towards lower values. Features located in the red zone; Silver (17.11), Iron Ore (0.145) and crude oil (57), which indicate those features, push the gold price towards higher values.

Fig. 4
figure 4

Explanation of the first prediction generated by the XGBoost model using tree SHAP

6 Conclusion

In this paper, we compare six different machine learning models to determine which one is more suitable to predict the gold price. The findings show that XGBoost, which provides the best result over the other competitive techniques and outperforms all well-known benchmark models. Furthermore, the results provide significant correlations among gold prices and all the predictor variables considered, i.e. crude oil, iron and silver price, USD_EUR and USD_CNY exchange rates, S&P 500 and inflation rate of US. This result demonstrates that those variables display a high capacity to forecast future gold price volatility.

Moreover, this study presents the SHAP method to unify the field of interpretable machine learning. Indeed, the proposed technique provides a rich visualization of individualized feature attributions that improves the interpretability of gold price fluctuations. Moreover, Tree SHAP’s advance our understanding of tree models. It offers an insightful means to interpret the findings from a complex framework such as XGBoost and also extract nonlinear relationships of features on the output of a model. We show how Shapley additive explanations can be used to interpret the outputs of XGBoost designed to predict the gold price.

As practical implications, our results offer some meaningful implications to the investors and policy makers. First, the choice of an accurate technique should represent an effective forecasting tool for central banks and investors. Central banks need to know the gold price movements in order to secure certain transactions or to make strategic reserves. For investors, gold helps to diversify the portfolio and serves as insurance in the case of risk and volatility, in other words, gold can be used as investors’ safe haven. If the direction of the gold market is successfully predicted, the investors might be better guided and earn a safe return. Second, our findings suggest that traders may benefit from the XGBoost algorithm and SHAP interaction values in their decisions. Successful forecasting procedure empowers traders to make decisions and plan for the future in order to enhance favorable scenarios. Third, our study will profit to policy makers as it gives a list of factors including oil as indicators for gold price. The SHAP method offers a powerful and insightful measure of the importance of every input variable in the prediction of future gold price fluctuations.

Our study presents a limitation common to all similar studies. Although the model proposed can achieve very accurate predictions, it should also be acknowledged that markets rely on a number of variables, like geopolitical decisions that can result in unpredictable movements.

Finally, future research might extend our work by considering additional variables, such as political or commercial factors as well as phases of economic instability, which are generally determining factors of the price of gold. Also, as another direction for future research is the application of the proposed model in forecasting other commodities prices. Moreover, it would be interesting to include one or more computational cost factors in the comparison of different forecasting models. A mathematical formulation based upon some operational research process would allow a more objective comparison. It would also be promising to equip some of the approaches presented above with a preprocessing stage to provide some hybrid adaptive approach (Saâdaoui, 2012). Moreover, developing multimodal extensions by first proceeding with an unsupervised clustering could also lead to sufficiently robust approaches to better capture the outliers present in the data (Saâdaoui, 2012). Metaheuristic optimization methods could therefore be useful to identify and estimate this type of method (Rabbouch et al., 2020).