1 Introduction

Accurate stock market multimedia (chart) prediction is considered impossible by the old school of thought. The Efficient Market hypothesis states that stock prices reflect all current information and new information leads to unpredictable stock prices. Random walk concluded that stock prices cannot be accurately predicted using historical values [17].

However, the attraction of good returns has led to myriad methods for price prediction. In the last three decades abundant research has been done in this area. But still researchers are of the view that prediction of stocks on non-linear non-stationary financial time series is one of the most challenging tasks. Several mathematical models have been developed but the results are still dissatisfying [21]. Studies focusing on forecasting the stock markets have been mostly preoccupied with forecasting volatilities [6].

Professional Traders use fundamental and technical analysis for price prediction. There is plethora of literature which suggests different methods for stock price and indices prediction. Fundamental approach is the traditional approach using company parameters [19]. Technical analysis is based on Dow Theory [19] and uses price history for prediction. It ranges from traditional statistical modelling to methods based on artificial intelligence and machine learning [28]. In literature many ANN models are evaluated against the statistical models for stock prediction. ANN has also been compared with different data mining classification algorithms [9, 28] and the comparison suggests that ANN models give better results [6]. Literature suggests that the first neural network in stock market prediction was given by White [29]. Classical ANNs were mostly used in stock in the later part of last century. However, Ten years ago researchers focused on applying Multi layer perceptron (MLP) for stock prediction [20, 24]. Of late, different variants of ANN in hybrid models have been applied to stock market prediction. Atsalakis et al. [1] have surveyed ANNs applied for stock prediction but they have not pointed out the feature extraction methods used for stock prediction. The literature includes Genetic Algorithm to optimise a RNN for stock forecasting [13]. Artificial Fish Swarm Algorithm (AFSA) was used to optimise RBFNN for stock prediction [23]. Extreme learning machine has also been used to construct Decision Support System for stock prices prediction and trading strategies [25]. Moreover, different feature extraction methods were used with ANN such as Curvilinear Component Analysis along with RBFNN was used for Bel20 Stock Market Index [16]. Fusion technique was used in Indian context to model stock data [22]. Very recently, combination of feature extraction using 2-Directional 2-Dimensional Principal Component Analysis ((2D)2PCA) with RBFNN has been applied for stock price prediction [4].

Some of the studies, however, have shown that ANN has few drawbacks and it is not suitable for stock prediction because stock market data has enormous noise and complex dimensionality. ANN exhibits inconsistent and unpredictable behaviour on this data [23]. Most of the neural networks have shallow architecture and are thus designed with one hidden layer. One of the reasons may be unsuccessful training strategy for multi-layer. Another problem with neural networks is that many of them tend to fall into a local optimum solution thus over-fitting [14]. However, deep architectures can overcome these problems [14] and they have already yielded promising performance in many fields including language [30], speech [18] and image [8, 10, 34]. Hinton first proposed the idea of deep learning and harnessed the power of model beyond three-level nets [7]. According to some recent papers DNNs can give better approximation to nonlinear functions than shallow models [15, 26]. DNNs have been applied to some time series forecasting and they have shown good results [11, 12].

Recent literature suggests that researchers are attempting to use deep learning for stock prediction. Successful application in speech domain [18] has led to the idea that since speech is a time series data and stock data is also time-series so this method can be used. However, the DNN techniques are not fully harnessed. A paper which uses autoencoders to extract features from the input variables has been proposed for stock trading strategies [27].

In this paper DNN has been introduced as a classifier to stock multimedia (chart) trend prediction and when compared with state of the art technologies [23] applied to stock this perform better. Section 2 describes the research methodology suggested for stock prediction. Section 3 provides the framework for the proposed model. It depicts the working of the model through block diagram. Along with this different features of the deep learning which are used in the model have been described. Section 4 implements the framework and describes the experimental setup. Section 5 is a results and analysis section which shows that DNN performs better than some of the latest techniques applied in this domain. Finally, Section 6 carries the conclusion and the future direction.

2 Research methodology

The proposed methodology uses (2D)2PCA for dimensionality reduction [33] and thereafter it uses DNN as a predictor. For (2D)2PCA, N samples in I consists of {I 11 , I 12 , I 1n , I m1 , I m2 , I mn } where N = m*n and the covariance can be defined as

$$ Cov(I)=\frac{1}{N}{\displaystyle \sum_{i=1}^m{\displaystyle \sum_{j=1}^n\left({I}_{ij}-\overline{I}\right)*{\left({I}_{ij}-\overline{I}\right)}^T}} $$
(1)

Where, \( \overline{I}=\frac{1}{N}{\displaystyle \sum_{i=1}^m{\displaystyle \sum_{j=1}^n{I}_{ij}}} \) is the mean of all samples. Here single value decomposition (SVD) is used to compute projecting subspace V d for first d largest Eigen value. The feature matrix Y for (2D)PCA is obtained by (2)

$$ {Y}_i = {I}^T*\ {V}_d $$
(2)

Next,Y T isused as the new training sample in place of I and the process is repeated to compute the output of (2D)2PCA as Z i . The output of (2D)2PCA is fed to the DNN.

DNN is a multi-layer feed forward neural network and it uses supervised learning as shown in Fig. 1. Here X i are nodes in the input layer and Y j represent neurons in the 1 st hidden layer and it uses hyperbolic tangent function for computation. Z k represent neurons in the 2 nd layer and it again uses hyperbolic tangent function for computation. Finally the output layer has two nodes P l which uses the softmax function for classification and linear function for regression. The hyperbolic tangent represents the activation function for the network. U ij are the weights connecting the input and 1 st hidden layer and b j are the biases for 1 st hidden layer. V jk are the weights connecting the 1 st hidden layer and the 2 nd hidden layer and c k are the biases for 2 nd hidden layer. Finally W kl are the weights connecting the 2 nd hidden layer and the output layer and d l are the biases for output layer.

Fig. 1
figure 1

Forward Propagation of a 4 layered DNN

$$ {Y}_j=f\left({X}_i,{U}_{ij},{b}_j\right)= tanh\left\{\left({\displaystyle \sum_{i=1}^3{X}_i*{U}_{ij}}\right)+{b}_j\right\}=\frac{e^{\left\{\left({\displaystyle \sum_{i=1}^3{X}_i*{U}_{ij}}\right)+{b}_j\right\}}-{e}^{-\left\{\left({\displaystyle \sum_{i=1}^3{X}_i*{U}_{ij}}\right)+{b}_j\right\}}}{e^{\left\{\left({\displaystyle \sum_{i=1}^3{X}_i*{U}_{ij}}\right)+{b}_j\right\}}+{e}^{-\left\{\left({\displaystyle \sum_{i=1}^3{X}_i*{U}_{ij}}\right)+{b}_j\right\}}} $$
(3)
$$ {Z}_k={f}_1\left({Y}_j,{V}_{jk},{c}_k\right)= tanh\left\{\left({\displaystyle \sum_{j=1}^4{Y}_j*{V}_{jk}}\right)+{c}_k\right\}=\frac{e^{\left\{\left({\displaystyle \sum_{j=1}^4{Y}_j*{V}_{jk}}\right)+{c}_k\right\}}-{e}^{-\left\{\left({\displaystyle \sum_{j=1}^4{Y}_j*{V}_{jk}}\right)+{c}_k\right\}}}{e^{\left\{\left({\displaystyle \sum_{j=1}^4{Y}_j*{V}_{jk}}\right)+{c}_k\right\}}+{e}^{-\left\{\left({\displaystyle \sum_{j=1}^4{Y}_j*{V}_{jk}}\right)+{c}_k\right\}}} $$
(4)
$$ {P}_l={f}_2\left({Z}_k,{W}_{kl},{d}_l\right)= softmax\left\{\left({\displaystyle \sum_{k=1}^4{Z}_k*{W}_{kl}}\right)+{d}_l\right\}=\frac{e^{\left\{\left({\displaystyle \sum_{k=1}^4{Z}_k*{W}_{kl}}\right)+{d}_l\;\right\}}}{{{\displaystyle \sum_{k=1}^4e}}^{\left\{\left({\displaystyle \sum_{k=1}^4{Z}_k*{W}_{kl}}\right)+{d}_l\;\right\}}} $$
(5)

Learning occurs when these weights are adapted to minimize the error on labelled training data. The loss error function which is the objective function is minimized for the model depending on whether the model terminates in a linear regression or classification. W is the collection {w i } 1:N-1 , where w i denotes the weight matrix connecting layers i and i + 1 for a network of N layers. B is the collection {b i } 1:N-1 , where b i denotes the column vector of biases for layer i + 1.

The model given in Fig. 1 is a regression problem. For regression the loss function is given below:

$$ Mean\kern0.5em Squared\kern0.5em Error=L\left(W,\kern0.5em \left.B\right|j\right)=\frac{1}{2}{\displaystyle \sum_{j=1}^n{\left({y}_j-{\widehat{y}}_j\right)}^2} $$
(6)

Here, y j is the actual output and ŷ j is the predicted output where j denotes number of training examples. The loss function for classification is given below:

$$ Cross\kern0.5em Entropy=L\left(W,\left.B\right|j\right)=-{\displaystyle \sum_{j=1}^n ln\left({\widehat{y}}_j\right)*{y}_j+ ln}\left(1-{\widehat{y}}_j\right)*\left(1-{y}_j\right) $$
(7)

In order to update weights and biases of the network a supervised training algorithm Stochastic Gradient Descent (SGD) is used. Following process is iterated till the convergence criteria are reached. First, W and B are initialized and then updated according to the following equations.

$$ {w}_{jm}={w}_{jm}-\alpha *\frac{\partial L\left(W\left.,B\right|j\right)}{\partial {w}_{jm}} $$
(8)
$$ {b}_{jm}={b}_{jm}-\alpha *\frac{\partial L\left(W\left.,B\right|j\right)}{\partial {b}_{jm}} $$
(9)

Here, α is the learning rate and w jm is the weight for m th neuron connecting layer j and j + 1. Similarly, b jm is the bias for m th neuron connecting layer j and j + 1 whereas \( \frac{\partial L\left(W\left.,B\right|j\right)}{\partial {w}_{jm}} \) is computed using backward propagation. The chain rule is used to compute this function and for the last output layer the computation is shown below:

$$ \frac{\partial L\left(W\left.,B\right|j\right)}{\partial {w}_{jm}}=\frac{\partial L\left(W\left.,B\right|j\right)}{\partial {f}_2\left({Z}_k,{w}_{kl},{d}_l\right)}*\frac{\partial f{\left({Z}_k,{w}_{kl},{d}_l\right)}^2}{\partial \left({\displaystyle \sum_{k=1}^4{Z}_k*{w}_{kl}+{d}_l}\right)}*\frac{\partial \left({\displaystyle \sum_{k=1}^4{Z}_k*{w}_{kl}+{d}_l}\right)}{\partial {w}_{jm}} $$
(10)

3 Proposed method

In this paper, we have two purposes: one is to introduce DNN as a proposed model in stock prediction and the other is to demonstrate that this method gives improved result compared to state of the art method. Figure 2 represents the proposed model.

Fig. 2
figure 2

Framework of the proposed model

3.1 Data Collection

The data is collected from NASDAQ and the prediction is done for individual stock. This is because the index data does not consider firms characteristics and company wise prediction is more useful for the investors [20]. Therefore the data is collected for Google stock multimedia (chart) which is an American multinational technology company specializing in Internet-related services and products. Our goal is to consider a time period long enough to capture a high diversity in price movements and also to avoid data snooping. The data set used for experiment is from August 19, 2004 to December 10, 2015. Hence, the model is built for working 2843 days. Further the data set is divided in training set and testing set. Training set consists of data from August 19 2004 to May 31 2011 and testing set from June 1 2011 to December 10 2015.

In the problem, each record of data set includes daily information which consists of the closing price, the highest price, the lowest price, and the opening price named at day t as x(t), x h (t), x l (t) and x o (t) respectively. Other technical analysis parameters used as input include the leading, lagging and trend change indicators to get a composite result.

This paper uses 36 variables for forecasting as used in literature [23]. The variables I 1 to I 36 are computed based on the equations in Table 1. However, two parameters have been replaced with Bollinger Bands which compare the volatility and the relative price levels [28]. These variables when used as input to the model provide the forecast for the closing price on the next day. The forecast is for short-term because data far from the forecasting date provides less and less information useful to forecasting value [16].

Table 1 Input variables for the stock market data set

3.2 Dimension reduction

(2D)2PCA is used to reduce the dimensions of the data set [33] as it projects the original raw data matrix into a projection matrix. There is loss of information due to this method but the processing time and the convergence speed of the model increases many fold. On a large data set such loss of information would not cause much variation in the output. The output of (2D)2PCA is fed to the DNN input nodes as shown in Fig. 2.

3.3 Forecasting

The forecasting is done in two phases where in the first phase the training is done to compute weights W and biases B of the model. In the second phase testing is done where W and B are used to compute the output. Before this the output of the (2D)2PCA is normalized [23] according to the (11) to bring it in a range [0,1] .

$$ {Z}_{ij}=\frac{Z_{ij}- min\left({Z}_i\right)}{max\left({Z}_i\right)- min\left({Z}_i\right)} $$
(11)

Here in (11), Z i denotes the output of (2D)2PCA and Z ij is the normalized output which is used as the input to the DNN

3.4 Regularization

DNN is a complicated network and uses large number of parameters and as the complexity of the model increases the bias decreases but variance increases. In order to balance the bias-variance trade-off regularization is used and it makesthe model simpler. Further it reduces the variance by limiting the biases and making few of them 0 thus reducing the generalization error (error rate observed on validation data) and avoiding model overfitting [3, 5, 31].

The model uses ℓ1 and ℓ2 norms for regularization as it modifies the loss function mentioned below:

$$ L\hbox{'}\left(W,\left.B\right|j\right)=L\left(W,\left.B\right|j\right)+{\lambda}_1{R}_1\left(W,\left.B\right|j\right)+{\lambda}_2{R}_2\left(W,\left.B\right|j\right) $$
(12)

For ℓ1 regularization R 1 (W,B I j) is sum of all absolute weights and biases. For ℓ2 regularization R 2 (W,B I j) is sum of squares of all weights and biases. The constants λ 1 and λ 2 are chosen to be very small, for example 10−5.

3.5 Advanced optimization

Adaptive learning rate algorithm ADADELTA [32] automatically combines the benefits of learning rate annealing and momentum training to avoid slow convergence. It predicts the stock prices very fast and gives more accurate result. Learning rate annealing is a heuristics approach and the drawback of this method is that it tends to slow down at local minima and move fast whenever suitable and moreover the learning rate is applied to all dimensions of the parameter [32].

Momentum is a per-dimension training method and it is an improvement over SGD. The gradients along the minima are much smaller but since they are in the same direction they keep accumulating, hence speeding up the training.

$$ {w}_t+1={w}_t+\varDelta {w}_t $$
(13)
$$ {b}_t+1={b}_t+\varDelta {b}_t $$
(14)
$$ \varDelta {w}_t=\left(\rho \varDelta {w}_t-1\right)-\eta \frac{\partial L\left(W,\left.B\right|t\right)}{\partial {w}_t} $$
(15)
$$ \varDelta {b}_t=\left(\rho \varDelta {b}_t-1\right)-\eta \frac{\partial L\left(W,\left.B\right|t\right)}{\partial {b}_t} $$
(16)

Here, ρ is a constant which is controlling the decay of previous parameter updates and η is global learning rate shared by all dimensions. ADAGRAD uses an update rule for bias b and for weight w which is given in (17)

$$ \varDelta {w}_t=-\frac{\eta }{\sqrt{{\displaystyle \sum_{i=1}^t{\left(\frac{\partial L\left(W,\left.B\right|i\right)}{\partial {w}_i}\right)}^2}}}*\left(\frac{\partial L\left(W,\left.B\right|t\right)}{\partial {w}_t}\right) $$
(17)

The ADAGRAD method is sensitive to the choice of learning rate η and since the denominator is continual accumulation of squared gradient η will continue to decay. As suggested by Zeiler [3] ADADELTA is used to overcome these limitations. ADADELTA accumulates the gradient for a certain window size and the denominator uses local estimate for recent gradients where at time t the running average is given by (18)

$$ E\left[{\left(\frac{\partial L\left(W,\left.B\right|t\right)}{\partial {w}_t}\right)}^2\right]=\rho E\left[{\left(\frac{\partial L\left(W,\left.B\right|t-1\right)}{\partial {w}_{t-1}}\right)}^2\right]+\left(1-\rho \right)*{\left(\frac{\partial L\left(W,\left.B\right|t\right)}{\partial {w}_t}\right)}^2 $$
(18)
$$ \varDelta {w}_t=-\frac{\eta }{\sqrt{E\left[{\left(\frac{\partial L\left(W,\left.B\right|t\right)}{\partial {w}_t}\right)}^2\right]}+\varepsilon }*\frac{\partial L\left(W,\left.B\right|t\right)}{\partial {w}_t} $$
(19)

In order to match the units of the numerator and the denominator an added term is put in (19) which becomes

$$ \varDelta {w}_t=-\frac{\sqrt{E\left[\varDelta {w_{t-1}}^2\right]}+\varepsilon }{\sqrt{E\left[{\left(\frac{\partial L\left(W,\left.B\right|t\right)}{\partial {w}_t}\right)}^2\right]}+\varepsilon }*\frac{\partial L\left(W,\left.B\right|t\right)}{\partial {w}_t} $$
(20)
$$ E\left[\varDelta {w_t}^2\right]=\rho E\left[\varDelta {w_{t-1}}^2\right]+\left(1-\rho \right)*\left(\varDelta {w_t}^2\right) $$
(21)

Here ADADELTA is an improvement over these two methods as it avoids the selection of hyperparameter. Since it is difficult to estimate learning rates for a DNN with deep architecture therefore ADADELTA for DNN gives better result.

4 Experimental setup

The purpose of this experiment is to predict the stock closing price and to compare the performance of this model with other ANN models. The latest literature shows that (2D)2PCA along with RBFNN has performed the best among all the other dimensionality reduction techniques combined with ANN [23]. Therefore for this experiment the dimensionality has been reduced using (2D)2PCA for RNN, RBFNN and DNN model. The reason for doing this is to bring uniformity across the models. The resultant dimensions of (2D)2PCA have been chosen to be 10 × 10,15 × 15 and 19 × 35 for a window size of 20 and therefore 36 × 20 matrix is reduced to the above mentioned dimensions. Window size is the number of days for which the data is being taken into account for predicting the next day’s data. Say, window size 20 means data is being taken for 20 days and results are predicted for 21st day.

Once the dimensionality is reduced the output is computed for each day based on Deep learning equations. Both the input data and output data for the training set are passed to the deep learning method. The regularization parameters ℓ1 and ℓ2 are set to 10−5. It is found that this is the best ℓ1 and ℓ2 for the range of ℓ1 and ℓ2 = 10 n(n = -5,-6….,-10). The loss function is set to the mean square error and the regression stopping criteria is set to 0. For ADADELTA from three possible values for ρ = 0.9, 0.99 and 0.999 the best one is selected. The best ε value is selected from the range 10 n (n = -4, -5, -6….,-10) and the numbers of epochs are set to 1000.

For the RBFNN model, Mean Squared Error goal is set to 0 for the range m*10 n(m = 1,2,…,9; n = 1,2….,9) is found to be and SPREAD is selected through repeated experiments according to performance considerations. The best SPREAD 6*103. The maximum number of neurons is set equal to the total number of dimensions i.e. for 10 × 10 it is 100. For Elman’s RNN [2] the learning rate is 0.1 and the number of units in the hidden layers is 10 where as the maximum number of iterations to learn is 1000. The performance was measured using different error parameters such as Root Mean Square Error (RMSE), Hit Rate (HR) and Total Return (TR) etc. are listed in Table 2.

Table 2 Formula for different error measures

The experiment is conducted using a PC with 4GB RAM, 2 GHz PCU on an R package of version 3.2.2. However since the state of the art technique RBFNN is implemented on MATLAB 7.1(R2010a) platform [23] therefore the same platform is used for RBFNN.

5 Results and analysis

The experimental results for different window sizes and dimensions of the model are shown in Fig. 3 and Tables 3, 4, 5, and 6. Table 7 compares the performance of the proposed DNN with RBFNN and RNN. The results are drawn for varying window sizes 20,40,60,80,100 to test which window size gives the better result. Along with this (2D)2PCA is used to reduce the dimensions of input matrix 36 × 20 to lowest range, middle range and last range.

Fig. 3
figure 3

Actual and Predicted closing prices for different window sizes and dimensions

Table 3 Errors measured for lowest range dimensions
Table 4 Errors measured for middle range dimensions
Table 5 Errors measured for last range dimensions
Table 6 Errors measured for window size 20
Table 7 Errors measured for various neural networks and DNN

In Fig. 3 the x-axis denotes the normalized closing price, y-axis denotes the number of days. For window size 20 both lowest range dimensions 10 × 10 and the middle range 15 × 15 give better performance as actual and predicted lines are quite close. However, for the last range dimensions 19 × 35 the results are not so satisfactory. For window size 40 the lowest range dimensions 10 × 10 give better performance. However, for the middle range dimensions 25 × 25 and last range 39 × 35 the results are not so satisfactory. For all the other window sizes the lowest range dimensions 10 × 10 provides the best result. Our assumption made based on literature [16] that the short term data provides better forecast is hence correct. The lowest range data performs better than middle range.

The errors are measured for each of the window sizes and the reduced dimensions according to the equations given in Table 2. It is found that amongst the lowest dimension matrix i.e. 10 × 10 the best performance is for the window size 20 which is shown in Table 3. Similarly it is found that amongst the middle range dimension matrix the best performance is for the window size 20 as shown in Table 4. Further it is found that amongst the last range dimension matrix the best performance is for the window size 20 as shown in Table 5. Finally, the results reflect confidence in our assumption taken from literature [16] that short term forecasting in case of stock prediction is more accurate.

Since window size 20 performs the best amongst all the other window sizes therefore the results are compared for its reduced dimensions as presented in Table 6. It is also found that 10 × 10 dimension matrix provides the best result and the total return for 10 × 10 is 1.36 and for 19 × 35 is 0.41 which is more than 200%. Additionally the forecasting accuracy measured by Hit Rate is 0.68 for 10 × 10 and 0.65 for 19 × 35 which is 4.4% better and for the remaining error measures the 10 × 10 dimension again gives a better result.

Since window size 20 and dimensions 10 × 10 provide the best result therefore this is used as input to DNN, RBFNN and RNN as displayed in Fig. 4.

Fig. 4
figure 4

Comparison of actual stock price and forecasted values from DNN, RBFNN and RNN

It is found that RNN is performing very poorly. However, RBFNN and DNN predicted values are very close to the actual value. For perspective, it is observed from the results of the DNN model that for window size 20 and dimensions 10 × 10 the model architecture uses a 4 layered network with 100 input units, 200 hidden layer units and 1 output unit. The model Mean Square error is 6.43e-05 and the training time is 3 min and 51 s.

The comparative error measures for these neural networks are shown in Table 7. It is found that RNN is not a good performer, but when DNN and RBFNN are compared it is found that they are very close. On a closer look Hit Rate performance is better for DNN as it is 4.8% more accurate than RBFNN and 15.6% better than the RNN. Therefore DNN can be a better predictor for trend prediction of stock market. The correlation coefficient between the actual and predicted return is 0.76 for DNN, it is 0.63 for RBFNN and 0.43 for RNN. This demonstrates that DNN is 17.1% more highly correlated than RBFNN and it is 43.4% better than RNN.

6 Conclusion

This is the first work using deep learning for stock data forecasting. In this paper it is demonstrated that (2D)2PCA + Deep learning on the Google dataset can improve the accuracy of stock multimedia (chart) prediction compared to conventional neural network methods along with (2D)2PCA. Also for varying window sizes and dimensions the model has been tested to improve the accuracy. It is found that for the window size 20 and dimension 10 × 10 the results are the best. The deep learning method for higher dimensions and large window sizes is giving limited performance.

Experimental results confirm that the proposed model provides a promising method for stock trend prediction as Hit Rate performance is better for DNN as it is 4.8% more accurate than RBFNN and 15.6% better than the RNN. Therefore DNN can be a better predictor for trend of stock market. The correlation coefficient between the actual and predicted return for DNN is 17.1% more accurate than RBFNN and it is 43.4% better than RNN. It is also found that the proposed model is not giving better results for Total Return and RMSE when compared to RBFNN. However, in future these parameters could be improved with other algorithms for Deep learning such as Deep Belief Network, Regularization, Autoencoders and Advanced Optimizations. Finally, it would be interesting to investigate the effectiveness of deep leaning in portfolio management and trading strategies.