Keywords

1 Introduction

In financial investing, the general goal is to dynamically allocate a set of assets to maximize the returns over time and minimize risk simultaneously. A very well-known financial trading strategy is statistical arbitrage, or StatArb for short, which evolved out of pairs trading strategy [15], where stocks are paired based on fundamental or market similarities [20]. In pairs intra-day trading, when one stock of the pair over-performs the other, the stock is sold short with the expectation that its price will drop when the positions are closed. Similarly, the under-performer is bought with the expectation that its price will climb when positions are closed. The same concept applies to the StatArb strategy, except that it extends at portfolio level with more stocks [34]. Furthermore, the portfolio construction is automated and comprises two phases: (i) the scoring phase, where each stock is assigned to a relevance score, with high scores indicating stocks that should be held long and low scores indicating stocks that are candidates for short operations; and (ii) the risk reduction phase, where the stocks are combined to eliminate, or at least significantly reduce the risk factor [4, 28].

Financial investors that use StatArb strategy face the important challenge to correctly identify pairs of assets that exhibit similar behaviour, also determining the point in time when such assets’ prices start moving away from each-other. As such, researchers have expended unremitting efforts on investigating novel approaches to tackle the asset choice problem and developed a wide range of statistical tools for the matter: distance based [20], co-integration approach [42], and models based on stochastic spread [26]. As previously noted in the literature [23], these tools exhibit a drawback as they rely solely on statistical relationship of a pair at the price level, and lack forecasting component. Moreover, if a divergence between stocks in a pair is observed, then it is assumed that the prices must converge in the future and positions are closed only when the equilibrium is reached, an event that is not accurately determined in time.

At the same time, the rapid growth of market integration yielded massive amounts of data in the finance industry, which promotes the study of advanced data analysis tools. By the same token, considering that StatArb is performed at portfolio level (hence a large number of assets is involved), the strategy needs to be implemented in an automated fashion. As such, cutting-edge analytical techniques and machine learning algorithms use has grown [22]. However, incorporating machine learning algorithms comes with its own set of drawbacks as the financial data contains a large amount of noise, jump and movement, leading to highly non-stationary time series that are thought to be highly unpredictable [35], thus deteriorating the forecasting performances. One successful alternative to mitigate the noise present in the data has already been proven to be ensemble methods. In literature, they demonstrated superior predictive performance compared to individual forecasting algorithms and hence their notorious success in different domains such as credit scoring [11] or sentiment analysis [3, 37, 38]. Furthermore, in literature, it has been proved that the employment of heterogeneous ensembles for forecasting outperforms homogeneous ones [9, 31]. When mentioning the forecasting, there are two different tasks that can be targeted: classification and regression. In literature, we can find several implementations of StatArb that use classification [30, 40] and this has always been proved easier to solve than the regression [39]. Although regression in the context of financial predictions poses more challenges [18, 33], it allows for a more granular ranking, without reference to any balance point. As such, in this paper we propose a general approach for risk-controlled trading based on machine learning and StatArb. The approach employs an ensemble of regressors and provides three levels of heterogeneous features:

  1. 1.

    Its components consist of any number of state-of-the-art machine learning and statistical models.

  2. 2.

    We train our models with information pertaining to constituents of financial time series with a diversified feature set, considering not only lagged daily prices return, but also a series of technical indicators.

  3. 3.

    We consider diversified models such as the ones that use as training either data from individual companies or companies in the same industry.

Finally, in our approach, after the assets have been ranked in descending order, we propose the use of a dynamic asset selection, which looks at the past and influences the ranking by removing stocks with bad past behavior. Then, the strategy buys (performing long operations) the flop k stocks and sells (performing short operations) the top k stocks.

In this paper, we also propose one possible instance of our approach that has been configured for intra-day operations and on the well-known S&P500 Index. The regressors we have employed for such an instance are the following state-of-the-art machine learning algorithms, Random Forests (RF), Light Gradient Boosted trees (LGB), Support Vector Regressors (SVR), and the widely known statistical model, ARIMA. ARIMA models are known to be robust and efficient for short-term prediction when employed to model economical and financial time series [1, 17] even more than the most popular ANNs techniques [32, 36].

To validate the configuration we have chosen for our instance, we evaluate its performance from both return and risk performance perspectives. The comparisons against Buy-and-Hold strategy of S&P500 Index and individual regressors that we adopted in our instance, lucidly illustrate its superiority in performing the forecast.

In summary, the contributions of this paper are the following:

  1. 1.

    We propose a general approach for risk-controlled trading based on machine learning and StatArb.

  2. 2.

    We defined the problem as a regression of price returns, instead of a classification one.

  3. 3.

    Our approach can be easily implemented using different types of assets.

  4. 4.

    We propose an ensemble methodology for StatArb, tackling the ensemble construction from three different perspectives:

    • model diversity, by using machine learning algorithms and even statistical algorithms;

    • data diversity, by considering lagged price returns and technical indicators so to enrich the data used by models;

    • method diversity, by simultaneously training single models across several assets (i.e., models per industries) and, conversely, models for each stock.

  5. 5.

    We develop a dynamic asset selection based on models’ most recent prediction performance that keeps the ranking of an asset if the past predictions of its return trend exceed a pre-determined behavior.

  6. 6.

    We provide a possible instance of our approach for intra-day trading with four kinds of regressors (machine learning algorithms and statistical models) for StatArb within the S&P500 Index.

  7. 7.

    We carried out a performance evaluation of our instance and its results outperform baseline methods on the S&P500 Index for intra-day trading.

The remaining of this paper is organized as it follows. Section 2 briefly describes relevant related work in the literature. Section 3 introduces the problem we are facing whereas Sect. 4 includes the architecture of the proposed general approach and the instance we have generated. All the features that we have used are described within Sect. 5. Section 6 details the regressors that we have been considered in the ensemble of our instance. Section 7 describes the proposed ensemble methodology and how we have aggregated the results of the single components. The dynamic asset selection approach is illustrated in Sect. 8. Section 9 discusses the experiments we have carried out and Sect. 10 ends the paper.

2 Related Work

The literature dealing with applications on machine learning and neural networks in finance is presented and analyzed in several works [2, 10, 12, 22]. The work in [23] proposes a StatArb system that entails three phases: forecasting, ranking and trading. For the forecasting phase, the authors propose the use of an Elman recurrent neural network to perform weekly predictions and anticipate return spreads between any two securities in the portfolio. Next, a multi-criteria decision-making method is considered to outrank stocks based on their weekly predictions. Lastly, trading signals are generated for top k and bottom k stocks. This approach considers constituents of S&P100 Index on a period spanning from 1992 to 2006. Although this approach also considers regression, it lacks scalability as its application is limited to 100 stocks, and in case of broader indexes such as S&P500 or Russell 1000, would become computationally intractable. In [40], deep neural networks were used and standardized cumulative returns were considered as features. Following the approach proposed by [40], in [30] the authors construct a similar classification problem using cumulative returns as input features and employ models like deep neural networks, random forests, gradient boosted trees and three of their ensembles. The authors validate their study using \( S \& P500\) Index constituents on a period ranging from 1992 to 2015, with trading frequency of one day. Later, the authors extend their work in [19] by using a Long Short-Term Memory network for the same prediction task. This enhanced approach outperforms memory-free classification methods. However, as the authors note, the out-performance is registered from 1992 to 2009, whereas from 2010 the excess return fluctuates around zero. The ensemble proposed in this work is used to tackle a classification problem whereas ours aims at solving a more difficult regression problem. In [29], the authors take a different approach for predicting returns of S&P500, where the used features are stock tweets information. The aim is to unveil how the textual data reflects in stocks’ future returns. For this goal, they use factorization matrix and support vector machines. The proposed system performs prediction in a 20 min frequency over a two years period: from January 2014 to December 2015. The selection of flop and top stocks is made at the formation period based on the algorithms performance evaluation (i.e. lowest root relative squared error) and trading signals are generated based on Bollinger bands. The authors state that their factorization machines approach yields positive results even after transaction costs. In contrast to previously presented studies, in this work we consider the trading performance of an ensemble of diversified regression techniques that considers diverse models and data. Additionally, our approach includes in the pipeline a dynamic asset selection within the risk reduction phase, in order to avoid bad past stocks performances that jeopardize future trading. Such a heterogeneous setup is important to deal with the uncertain behavior of the market, as richer models and complementary information are used in the process. Moreover, the proposed approach can be regarded as generic as it can be instantiated with a huge number of configurations: number and types of regressors, market type (e.g. intra-day), selected features (e.g. lagged returns, technical indicators), number of assets to buy or sell (choice for k).

3 Problem Formulation

The problem tackled by our general approach consists of an algorithmic trading task in the context of StatArb that leverages machine learning to identify possible sources of profit and balance risk at the same time. The StatArb technique consists of three steps: forecasting, ranking, and trading.

  • Forecasting - We tackle StatArb as a regression problem, investigating the potential of forecasting price returns for each of the assets in a pre-selected asset collection S, on a target trading day d.

  • Ranking - Based on the anticipated price returns for the assets, we rank them in descending order. We balance the risk incurred by inaccurate predictions by pruning the “bad” assets based on their past behavior. This dynamical asset pruning yields a reorganized ranking of the assets.

  • Trading - Having the trading desirability given by ranking in the previous stage, we issue trading signals for the top k and flop k stocks.

4 The Proposed Approach

Fig. 1.
figure 1

Architecture of the proposed general approach for risk controlled trading

Fig. 2.
figure 2

Illustration of walk-forward procedure

Figure 1 depicts the architecture for the general approach for risk controlled trading we propose in this paper. Once the set of assets to work with has been selected, first we collect raw financial information for each asset \(s_i\) in the pre-selected asset collection S. We split our raw data in study periods, composed of training (in-sample data, used for training models) and trading (test) sets, which are non-overlapping. This procedure is a well-known validation procedure for time-series data-sets [16], known as walk-forward strategy. Figure 2 illustrates such a procedure. For each study period and each asset \(s_i\), we generate the diversified feature set denoted by \(\mathcal {F}_{d-1}^{s_i}\), using information available prior to the target date d. For in sample period we also generate the label \(y_{d}^{s_i}\). The feature set it used as input to each regressor m in our regressors pool \(\mathcal {M}\). The forecast is then performed using test data, where each trained model makes its prediction, \(o_{d}^{s_i,m}\) for day d and stock \(s_i\). Then, their results are averaged by a given ensemble method, to obtain a final output \(o_{d}^{s_i,ENS}=\frac{\sum \limits _{m\in \mathcal {M}}{o_{d}^{s,m}}}{n(\mathcal {M})}\). Next, we sort assets in descending order. That means that we will find at the top assets whose prices are expected to increase, and at the bottom assets whose prices will drop. Assets at the top and at the bottom of our sorting represent the most suitable candidates for trading. After the ranking is performed, we introduce the dynamic asset selection step: from this pool of assets, we discard those that do not satisfy a prediction accuracy higher than a given threshold \(\varepsilon \) in a past trading period, rearranging the ranking accordingly. The next step consists of selecting the top k (winners) and flop k (losers) assets and issue the corresponding trading signals: k long signals for the top k stocks and k short signals for the bottom k stocks. These selections are repeated for every day d in the trading period. Finally, we evaluate the performance of our architecture by means of back-testing strategy [4]. As mentioned in the introduction we have instantiated one example out of our general approach by using as pool of assets the stocks within the S&P500 Index [19, 30], the trading session to be intra-day. Also, we fixed the number of pairs to be traded to \(k=5\), based on the findings in similar works [19, 30] where higher k values leads to a decrease in portfolio performance both in terms of returns and risks. The set of features \(\mathcal {F}\) and the regressors will be described, respectively, in the next two sections.

5 Feature Engineering

As already mentioned, our dataset of reference for the instance we propose is the S&P500 Index. Therefore we have collected the information for all the stocks that have been listed, at least once, as constituents of it in a period from January 2003 to January 2016.

For each stock, we have available daily raw financial information such as Open Price, High Price in the day, Close Price, Low Price in the day, and Volume of stocks traded during the day. Based on this information, we have created two different kinds of features:

  1. i.

    Lagged daily price returns (LR): historical price returns are the set of features most used in financial studies. For a given trading day d, in the lag \([d-\Delta d, d-1]\), we compute the \(LR_{d,\Delta d}\) as follows:

    $$\begin{aligned} LR_{d,\Delta d}=\frac{closePrice_{d-\Delta d}-openPrice_{d-\Delta d}}{openPrice_{d-\Delta d}}, \end{aligned}$$
    (1)

    We have set \(\Delta d \in \{1,\dotsc ,10\}\), thus having for each day d 10 different lagged price returns shown as it follows:

    $$\begin{aligned} {\begin{matrix} [LR_{d-10}^{s_i}, LR_{d-9}^{s_i}, LR_{d-8}^{s_i}, LR_{d-7}^{s_i}, LR_{d-6}^{s_i}, LR_{d-5}^{s_i}, LR_{d-4}^{s_i}, LR_{d-3}^{s_i}, LR_{d-2}^{s_i}, LR_{d-1}^{s_i}] \end{matrix}} \end{aligned}$$

    The target value associated to this feature vector is the intra-day price return for d.

  2. ii.

    Technical Indicators (TI): following [25], we use a set of technical indicators summarized in Table 1. We opted for this set of features as we are interested in predicting the price movement range and also its direction. Each of the technical indicators has different insights of the stock price movement.

For this second type of feature we built the following vector:

$$\begin{aligned} {\begin{matrix} [EMA(10), \%K, ROC, RSI, AccDO, MACD, \%R, Disp(5), Disp(10)] \end{matrix}} \end{aligned}$$

Similarly as for the LR feature vector, the associated target value (label) is the intra-day price return for the current day.

Table 1. Selected technical indicators and their acronyms throughout this paper.

6 Baselines

In the proposed instance of our general approach we considered the following three different state-of-the-art machine learning models, and the widely known statistical model, ARIMA. We based our choice to employ such models on the following criteria: (i) robustness to noisy data and over-fitting. (ii) diversity amongst models in the final ensemble, and (iii) adoption of such models in the scientific community for similar tasks.

Light Gradient Boosting (LGB) is a relatively new Gradient Boosting Decision Tree algorithm, proposed in [27], which has been successfully employed in multiple tasks not only for classification and regression but also for ranking. LGB applies iteratively weak learners (decision trees) to re-weighted versions of the training data [21]. After each boosting iteration, the results of the prediction are evaluated according to a decision function and data samples are re-weighted in order to focus on examples with higher loss in previous steps. This method grows the trees by applying the leaf-wise (or breadth-first) strategy until the maximum depth is reached, thus making this algorithm more prone to over-fitting. To control this behavior we defined the maximum depth levels of the tree, max_depth, to 8. We chose to vary the num_leaves parameter in the set [70, 80, 100], achieving a balance between a conservative model and a good generalization. The feature selection is restricted by a parameter colsample_by_tree set at 0.8 of the total number of features, which can be thought as a regularization parameter. The work in [21] suggests a learning rate lower than 0.1, so we set it to 0.01 to account for a better generalization over the data set.

Random Forests (RF) belong to a category of ensemble learning algorithms introduced in [8]. This learning method is the extension of traditional decision trees techniques where random forests are composed of many deep de-correlated decision trees. Such a de-correlation is achieved by bagging and by random feature selection. These two techniques make this algorithm robust to noise and outliers. In the case of RF, the larger the size of the forest (the number of trees), the better the convergence of the generalization error. But a higher number of trees or a higher depth of each tree induces computations costs, therefore a trade-off must be made between the number of trees in the forest and the improvement in learning after each tree is added to the forest. We opt to vary the number of trees by ranging n_estimators from 50 to 500 with a 25 increment. We based our choice on the work of [24]. Random feature selection operations substantially reduce trees bias, thus we set min_samples_leaf to 3 of the total number of features in a leaf. The learning rate is set to 0.01.

Support Vector Regressors (SVR) were proposed initially as supervised learning model in classification, and later revised for regression in [41]. Given the set of training data the goal is to find a function that deviates from actual data by a value no greater than \(\varepsilon \) for each training point, and at the same time is as flat as possible. It extends least-square regression by considering an \(\varepsilon \)-insensitive loss function. Further, to avoid over-fitting of the training data, the concept of regularization is usually applied. An SVR thus solves an optimization problem that involves two parameters: the regularization parameter (referred to as C) and the error sensitivity parameter (referred to as \(\varepsilon \)). C, the regularization cost, controls the trade off between model complexity and the number of non-separable samples. A lower C will encourage a larger margin, whereas higher C values lead to hard margin [41]. Thus, we set our search space in \(\{8, 10, 12\}\). Parameter \(\varepsilon \) controls the width of the \(\varepsilon \)-insensitive zone, and is used to fit the training data. A too high value leads to flat estimates, whereas a too small value is not appropriate for large or noisy data-sets. Therefore, we set it to 0.1. In this study, we selected the radial basis function (RBF) as kernel. The work in [13] suggests that the \(\gamma \) value of the kernel function should vary together with C, and higher values of C require higher values for gamma too. Therefore, we set a smaller search space in \(\{0.01, 0.5\}\).

ARIMA model was first introduced by [7], and has been ever-since one of the most popular statistical methods used for time-series forecasting. The algorithm captures a suite of different time-dependent structures in time series. As its acronym indicates ARIMA(pdq) comprises three parts: autoregression model that uses the dependencies between an observation and a number of lagged observations (p); integration differencing of observations with different degree, to make the time series stationary; and Moving Average model that accounts the dependency between observations and the residual error terms when a moving average model is used to the lagged observations (q). We chose the lag order \(p \in \{1,5\}\), the degree of differencing \(d\in \{1,5\}\), the size of the moving average window \( q\in \{0,5\}\).

7 Ensemble

In the last section we have described the regressors that are included in the ensemble of the instance we proposed in this paper alongside with the parameters space used for each of them. Besides features mentioned in Sect. 5, and parameters intrinsic to each of forecasting models mentioned in Sect. 6, we also considered: – a model for each stock \(s_{i} \in S\) in the training period, – a model for each industry by grouping stocks by their industry sector as given by the Global Industry Classification Standard (GICS). This was encouraged by previous work [20], where some portfolios were restricted to only include stocks from the same industry. Moreover, usually companies in the same industry tend to have similar behavior and exhibit some sort of correlation in their stock prices movement. As such, our training and model selection procedure is composed of three steps. As illustrated in Fig. 2, for each walk and each asset (stock):

  • We split the training portion of the data-set into development and validation sets;

  • Each type of model has been trained on the development subset. For the training of each regressor, we used an inner cross-validation with 10 folds to find the optimal hype-parameters. Consequently, to forecast the return of each asset, we created 4 models: 2 models (per industry) using TI or LR as features, that use data of all assets associated to that industry, and, in turn, forecast one asset at a time; 2 models (per asset) using TI and LR, that use data of a single asset. Then, using the validation set, we compute the MSE between the forecast and the ground truth, and choose the best model out of the four, per each asset for that walk;

  • Finally the best model found at the previous step is trained on the full training set and tested on the test set.

During each walk and for each stock, LGB, RF, SVM, and ARIMA predictions are averaged to obtain the ensemble forecast.

8 Dynamic Asset Selection

We propose a stock pruning mechanism by performing a dynamic asset selection strategy. For a stock \(s_i \in S\), given its past forecastings \(o_{t}^{s_{i},ENS}\), and also its past real values \(y_d^{s_i}\) in a predefined look-back period T, we compute a modified version of the mean directional accuracy [5, 6] as follows:

$$\begin{aligned} MDA_{s_i,T,d}=\frac{1}{T}\sum _{t=d-1}^{d-T-1}\mathbf {1}_{sgn(o_{t}^{s_i,ENS})==sgn(y_{t}^{s})} \text {,} \end{aligned}$$
(2)

where d is the current trading day, T is the look-back length and \(\mathbf {1}_{P}\) is the indicator function that converts any logical proposition P into a number that is 1 if the proposition is satisfied, and 0 otherwise, \(sgn(\cdot )\) is the sign function. The \(MDA_{s,T,d}\) metric compares the forecasted direction (upward or downward) with the realized direction, providing the probability that the forecasting model can detect the correct direction of returns for a stock \(s_i\) on a given timespan T prior to day d. Such a component introduces a new step in the StatArb pipeline: after the forecast is done, we rank the companies by their forecasted daily price returns. From this pool of companies, we discard those that do not satisfy a prediction accuracy higher than a given threshold \(\varepsilon \) in a past trading period, rearranging the ranking accordingly. The proposed dynamic asset selection strategy requires a series of parameters: the accuracy threshold \(\varepsilon \), and rolling window length related to the past trading period, T. We made these choices based on findings in [14] where the authors noticed that MDA can efficiently capture the inter-dependence between asset returns and their volatility (hence forecast-ability) when using intermediate return horizons, e.g. two months. The threshold value has been set to \(\varepsilon =0.5\) as advised in [23] for a similar scenario.

9 Experimental Framework

We conducted the experiments on the S&P500 Index dataset focusing on data from January 2003 to January 2016. We considered four years for training (that is why our tests begin from March 2007)Footnote 1 and approximately one year for trading (or testing). We compared our approach (ensemble with the dynamic asset selection, ENS-DS), against the ensemble without the dynamic asset selection (ENS) and against each single regressor and the well known Buy&Hold passive investment strategy, known to be representative in finance communities [30].

The metrics we have used for comparison are: (i) return (cumulative, annual and mean daily); (ii) Sharpe ratios; and (iii) Maximum drawdown. Return defines the amount that the returns on assets have gained or lost over the indicated period of time. The Sharpe ratio (SR) measures the reward-to-risk ratio of a portfolio strategy, and is defined as excess return per unit of risk measured in standard deviations. The Maximum drawdown (MaxDD) is the maximum amount of wealth reduction that a cumulative return has produced from its maximum value over time. The results are summarized in Table 2. According to the cumulative return development over time in Table 2, the ENS strategy outperforms all the other non-ensemble models. Its daily returns is almost ten times the level of the Buy&Hold and up to three times the return of some individual regressors, (e.g., RF). Moreover, compared to the simple average ensemble, the ENS-DS approach (with \(T=40\)) has a performance increase of 5% points.

Table 2. Results of the StatArb strategy over a period between March 2007 to January 2016

Besides the return, in terms of risk exposure, the MaxDD offers an outlook on how sustainable an investment loss can be (lower is better). Also for this metric we notice the better performance of ENS-DS compared to the Buy&Hold strategy and each other baseline. The ENS-DS strategy produces a MaxDD of 11.5% that is less than one fourth of the Buy-and-Hold strategy(45%). Finally, it can be noticed that SR started from 1.76 for the simple ensemble and turned into 2.01 for the proposed ENS-DS, beating all the other baselines.

10 Conclusions and Future Work

In order to provide insights about efficient stock trading, in this paper we proposed a general approach for risk controlled trading based on machine learning and statistical arbitrage. The forecast is performed by an ensemble of regression algorithms and a dynamic asset selection strategy that prunes assets if they had a decreasing performance in the past period. As the proposed approach is general as all of its components, we created an instance out of it where we focused on the S&P500 Index, using the statistical arbitrage as a trading strategy. Moreover, we propose to forecast intra-day returns using an ensemble of Light Gradient Boosting, Random Forests, Support Vector Machines and ARIMA. We also proposed a set of heterogeneous features that can be used to train the models. By performing a walk-forward procedure, for each company and walk we tested all the combinations of features and internal parameters of each regressor to select the best model for each of them. The ensemble decision has been performed for each walk and company by averaging the forecast of each regressor. Our experiments showed that our ensemble strategy with the dynamic asset selection reaches significant returns of 0.119% per day, or 36.6% per year. As future work we are already working on the application of our approach in other markets and comparisons with different baselines. Further directions where we are headed include enriching the current approach with new types of assets, exogenous variables and the employment of deep neural networks.