1 Introduction

Prediction of different events like flood, weather, earthquake, election, and stock are considered as the challenging tasks and have a major impact on human life. Stock prices prediction is a complicated and dynamic process due to rapid prices movement and market behavior. The prediction techniques in the stock market can play a crucial role in bringing more people and existing investors at one place (Zhang et al. 2013). The Stock market forecasting and analysis have gained the attention of researchers in the field of data mining to develop applications which could assist individuals as well as business organizations in decision making to invest money or not on a specific stock by taking into account the profit and risk.

The major objective of stock market prediction is to design software applications that can (i) analyze stock-related data from different financial resources, (ii) process that data using some prediction model, and (iii) present information to users. Design of such applications receive the considerable attention from financial experts, economists, businessmen, and others. Many people, business and organization consider a stock business as a profitable resource and wish to take part in the particular business. However, before purchasing and selling the stock, they have to estimate the profit and demand of particular stock using online financial data and trends. Therefore, the stock prediction is a challenging task due to different factors, such as retrieval and analysis of relevant stock data, fitting data into appropriate model and interpretation of results (Ladan et al. 2014).

The existing studies (Ladan et al. 2014; Kamley et al. 2013; Javaid 2010; Reddy 2010; Yuan and Luo 2014; Park et al. 2010; Ariyo et al. 2014; Devi et al. 2013) on stock-based prediction are based on the sentiment-based and secondary data acquired from social sites and financial sources. The sentiment-based approaches for stock market prediction mainly rely on unstructured data extracted from social media sites. The major limitation of such approaches is the low accuracy of prediction model due to noisy nature of input data. The other category of stock-oriented prediction techniques is based on analytical models using secondary data. The analytical approaches provide a more accurate prediction as compared to sentiment-based techniques due to well-structured nature of input data. However, such methods for the development of stock-based prediction systems rely on data that is usually sparse in nature. Moreover, the selection of predictor variables is also poor, which ultimately degrades the performance of prediction model. The problems associated with existing approaches can be overcome by proposing an effective prediction model with improved quality of input data and revised selection/inclusion of predictor variables.

The proposed method is based on multiple regression analysis of stock trend prediction, supported by a revised set of stock indicators. The key addition to the state of art methods (Ladan et al. 2014; Kamley et al. 2013; Javaid 2010) is in the way it preprocesses the input data with a revised set of stock indicators. Our system can take dataset as input, apply preprocessing steps, perform prediction, and gives a recommendation about the stock investment to user on the basis of prediction. The proper selection of stock indicators and applying preprocessing steps provides efficiency of the proposed system with respect to the state-of-the-art methods in terms more accurate prediction of the stock trend.

The main of the contributions of this work include the development of a multiple regression-based stock market trend prediction system using a revised set of predictors (Appendix A of Supplementary Material). Following is the synopsis of contributions presented in this work.

  • Applying different pre-processing steps for dimensionality reduction, such as data cleaning using fill in missing values and data reduction using volume compaction.

  • To develop a multiple regression-based prediction model for predicting stock trend using a revised set of predictors.

  • To provide a user-friendly decision support interface for individuals and companies to invest or not in the stock market.

  • Evaluating performance of the proposed system with respect to baseline methods.

To achieve aforementioned objectives, the intention is to contribute knowledge, beneficial for stock market investors by developing a stock trend prediction system based on multiple regression model to assist the individuals and companies interested to take a decision regarding making an investment in the stock market.

The rest of paper is structured as follows. Section 2 presents literature review. In Sect. 3, we describe the proposed method. Experiment design is presented in Sect. 4. The final section outlines the work with a discussion on how it can be expanded in future.

2 Related work

The literature review section deals with the discussion on relevant studies on stock market trend prediction. The prior works are classified into two major categories, namely (i) analytical approach, and (ii) sentiment-based approach. Figure 1 shows proposed classification scheme of the literature review.

Fig. 1
figure 1

Classification of stock prediction approaches

The first level shows that there are two major approaches, namely (i) analytical, and (ii) sentiment-based for stock market prediction. The level two shows related data sources for each approach. The last level depicts the relevant models used for both of the aforementioned approaches of stock prediction. The rest of the section is based on the aforementioned literature review classification scheme.

2.1 Analytical approaches

These approaches deal with the acquisition of secondary and historical stock data from different financial sources, such as Yahoo Finance, Google Finance, and Pakistan Stock Exchange. Analytical approaches are further classified into Regression driven and Autoregressive Integrated Moving Average (ARIMA) techniques.

The regression driven approaches are based on different regression models such as linear regression, multiple regression, logistic regression and genetic algorithm. The collected data is passed through a specialized pre-processing module to prepare it for further processing. Finally, pre-processed data is made an input to the prediction module.

In data mining and statistics, multiple regression model is a technique for estimating the relationship between a dependent and one or more independent variables (Zhang et al. 2013). In multiple regression, an attempt is made to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation. The model associates each value of the independent variable x with a value of the dependent variable y. Following are some related works based on regression paradigm.

Ladan et al. (2014) introduced a multiple regression model to establish the relationship between different macroeconomic variables, namely: inflation rate, exchange rate and interest rate and stock prices. They compiled the dataset from CBN Statistical Bulletin 2010 available at www.cenbank.org. It is reported that exchange rate is significant, whereas interest rate is not significant and there exists a high correlation between the macroeconomic variables, such as inflation rate, exchange rate, and stock market returns. Moreover, it is observed that 54.2% of the total variation in all Index can be justified by the microeconomic variable. However, they used only limited number of macroeconomic variables, which if increased, can produce better results.

Kamley et al. (2013) applied the multiple regression approach to predict the stock market price from the stock market data (yahoo finance) using three variables, namely: open, high and close. Different pre-processing steps are applied for the cleansing, integration, and transformation of data. They achieved a prediction accuracy of 89%. However, the major limitation of their work includes lack sufficient training of the model due to poor selection of stock indicators.

Javaid (2010) used multiple regression model to predict market shares of companies on different parameters, namely: KIBOR, dividend, earning per share, gross domestic product, and inflation. They used Karachi-Stock-Exchange (KSE) 30-Index dataset and achieved up to 62% prediction accuracy. However, more enhanced variables can be used to predict the stock trend.

Reddy (2010) developed a statistical analysis tool (SAS) system to forecast stock market data by collecting secondary data over different periods from the National Stock Exchange (Nifty). They identified the seasonal differences and reported that the data series is likely to more accurate to build the model such as auto regression (AR), moving average (MA). Finally, forecasting is made for the market index by presenting change in market indices in a graphics mode. The limitation of AR model includes the huge difference between predicted and real-time values, due to some external factors, which often leads to inaccurate prediction.

Yuan and Luo (2014) analyzed the factors: price, trading volume, and position which affect the price movements. The required data is acquired from the “wind information terminal”, including the index of opening price, closing price, highest and the lowest price, volume and total holdings. The decision tree model is applied to predict price movement based on the selected variables by acquiring a prediction accuracy of about 70%. Moving average price has more significant with respect to volume and interest. They reported that the volume and position change has a limited role in price forecasting.

Park et al. (2010) describes the procedure to analyze financial data using multiple regression model. They used MRDDV model that emerges quantitative as well as qualitative variables to estimate the behavior of data. S&P closing price and XHB daily closing price are used as quantitative variables and consumer sentiments and housing construction data are used as qualitative variables. The model indicates the highs of y, the model shows lows in the stock data in both cases. However the lac of their study to include only one quantitative and one qualitative variable instead of using several quantitative variables for more effective analysis.

Unlike regression models, Autoregressive Integrated Moving Average (ARIMA) techniques, introduced by Box and Jenkins in 1970, are considered as effective for short-term prediction, such as financial time series forecasting. Following are some the selected studies.

Ariyo et al. (2014) introduced a short-term stock prediction system using ARIMA model using stock-related data from New York Stock Exchange (NYSE) and Nigeria Stock Exchange (NSE). The model achieved results with respect to comparing methods.

Devi et al. (2013) developed a stock trend forecasting system using ARIMA model by highlighting seasonal trend and flow. The model produces efficient results in terms of effective time series analysis for stock trend prediction with respect to baseline studies.

In their work on Indian stock prediction, Mondal et al. (2014) performed experiments on 56 Indian stocks using ARIMA model. The proposed model is evaluated using Akaike information criterion (AIC). Experimental result show effectiveness of the model in terms of improved prediction accuracy over previous different periods of data.

2.2 Sentiment-based approaches

In these approaches, the stock prediction is performed on the basis of user-generated reviews about a particular stock, posted on different online forums and review sites. User reviews about a particular stock are extracted, pre-processed and classified using supervised and unsupervised learning approaches. In the following paragraphs, some of the prior works performed on sentiment-based approaches are presented.

Rao and Srivastava (2012), analyzed the relationship between tweets analogy such as bullishness, volume, agreement with the financial market indicators like volatility, trading, and stock prices, and achieved a stable correlation. On daily basis, the sum of positive and negative tweets are calculated using Naïve Bayesian classifier. For forecasting, they incorporated an Expert Model Mining System (EMMS) with R square and mean absolute percentage error. The prediction accuracy increases with the increase in time windows and vice versa. However, they conducted the experiments by capturing sentiments of a particular index or company instead of multi-company indices.

Qasem et al. (2015) used a machine learning techniques to classify twitter sentiments. Stock sentiments are collected mainly from Twitter, Google, Facebook, and Tesla. After pre-processing, the neural network and logistic regression model are trained for further analysis. Results show that uni-gram-based TF-IDF outperforms bi-gram-based TF IDF with an accuracy of 58%. The limitation of their work includes not using the clustering techniques.

Nasseri et al. (2015) proposed an intelligent trading support system using text mining techniques and reported that stock-related microblogs affect the stock prices. The stock-related sentiments are collected from “StockTwits” micro-blogging site. After performing feature selection for extracting relevant terms in a tweet, decision tree model is applied to identify the decision trend of important terms. Results show that user’s sentiments act as a valuable resource for stock trading decision. The limitation of their work includes the limited length of N-grams for extracting tweets.

Enke et al. (2011) proposed a three-layer stock market prediction system based on regression analysis. The first layer chooses only those variables which have a positive and significant relationship with the target. After selecting the appropriate parameters, next stage is to apply type-2 fuzzy clustering technique on these parameters for constructing a cluster of related data and to extract fuzzy rules from such clusters. Finally, the extracted fuzzy rules are passed through a neural network for effective prediction of the stock market behavior.

In the classification of Indian stock market data, Soni and Shrivastava (2010) applied three supervised machine learning algorithms: classification and regression tree (CART), LDA and QDA, which provides a comprehensive mode of analysis of stock market data in form of a binary tree, linear surface, and quadratic surface. The performance of the three machine learning algorithms is evaluated in terms of misclassification and correct classification rate. The misclassification rate of CART algorithm is 56.11%, which is smaller than the other two machine learning algorithms, indicating a better classification performance. Table 1 gives an overview of the selected studies conducted on stock prediction. The proposed system is shown in Fig. 2.

Table 1 Overview of selected studies
Fig. 2
figure 2

Proposed system

3 Methods

The proposed system implements a multiple regression technique using a revised set of stock predictors supported by computation and preprocessing steps for the prediction of a stock market trend in the form user-friendly recommendation interface to end users.

3.1 Data collection

We extracted three different historical local stock quotes from a financial website (ksestocks.com/QuotationsData), spanning over the duration of 2 years (01/01/2014 to 01/12/2015) and having the different attributes like stock closing, stock opening, high, low, close and volume. Each stock is represented by a stock symbol like KSE Stand for Karachi stock exchange, LUCK represents Lucky Cement Stock (Table 2). In this work, we used Karachi stock exchange as a data collection resource, as it holds a bulk of historical data relevant to local stock. Each data set consist of 24 months historical stock related instances. Table 3 shows the detail of KSE 100 index stock quotes, where the “Symbol” column represents a specific company or stock equity, such as KSE stand for Karachi stock exchange, “Date” column shows the time and day of stock transaction. The “Open”, “High” and “Low” columns indicate the starting value, peak value and lowest value of an equity for a specified date, and “close” represents the final worth of an equity at the time of closing stock market. Similarly “Volume” column shows the number of stock traded on a specific day.

Table 2 Datasets used
Table 3 Sample KSE 100-INDEX stock quotes

To acquire historical data from KSE stock/quotation, following steps are performed: (i) open browser and write the desired address in the address bar, (ii) select a specific script from a date to date and MS excel format option, (iii) click the prepare summary file button, and (iv) after selecting all of the appropriate options mentioned above, click on the start download image to successfully download a specific dataset in excel format.

3.2 Computation and pre-processing of stock market indicators

Proper selection, computation and preprocessing (Appendix E, F, G of Supplementary Material) of stock market indicators assist in developing the quality input data and improving the efficiency the stock prediction process. Computation and pre-processing of Stock Market Indicators aim at the preparation of acquired dataset for subsequent processing to achieve the results efficiently (Zhang et al. 2013). The basic objective is to derive relevant information from existing attributes to be used as an input parameter to train the multiple regression model. In this work, we propose to use three predictor variables, namely: Return, Volatility, ROI (Return on Investment) and Volume; and one predicted variable Return, to effectively predict future stock market trend regarding stock securities. The aforementioned variables have been selected due to their efficacy reported in the recent studies (Ladan et al. 2014; Kamley et al. 2013; Javaid 2010). Later, the effectiveness of the selected variables is verified by applying t test (see Sect. 4). For this purpose, following computations and preprocessing steps are applied (Algorithm 1) on input data to reduce data sparseness from the collected data to obtain stock indicators (predictor and predicted variables) with a minimum level of noise for subsequent processing (Simon and Raoot 2012). Let us take the brief description of the aforementioned indicators.

3.2.1 Computation of stock change

The Stock change computation is performed to identify stock return or gain. Stock change can be computed (Eq. 1) by subtracting the previous stock closing price from the current days close and dividing it by previous day close (Rao and Srivastava 2012).

$$Stock{\text{\_}}change = (current\;day\;stock\;closing\;price - previous\;say\;stock\;closing\;price)/(previous\;say\;stock\;closing\;price)$$
(1)

Stock change computation of 2 January 2014 is illustrated with the following example, using Eq. 1\(Stock{\text{\_}}change = (321.67 - 310.6)/310.6 = 0.0 3 5 6 4 1\). A sample list of entries for stock change computations is presented in Table 4.

Table 4 Stock change computation

3.2.2 Computation of stock gain/return

The stock gain or return is the profit obtained by the investors, after purchasing some stock security to do business in the stock market. It is the dependent variable, whose value is to be predicted on the basis of other predictors used in the multiple regression model. If the value of “change” (Table 5) is greater than zero, then its value is assigned to “Return” column using computations performed in MS-Excel software.

Table 5 Stock gain computation

3.2.3 Determination of dispersion/volatility

The stock volatility is an important indicator that acts as a predictor variable in the proposed regression model to predict stock return, showing how much variability exists in the stock return. Normally, Volatility measures the dispersion of returns and can be calculated by using the Standard Deviation or variance of the stock change of the same security multiplied by square root of 360 (360 days in a year). If the volatility is high (≥ 50%), it indicates that the stock is riskier (Rao and Srivastava 2012). A sample list of entries for stock volatility computations is presented in Table 6.

Table 6 Stock volatility computation

3.2.4 Stock volume

The stock volume is considered as a strong indicator to effectively forecast the stock market return, enabling the stock customers to make an investment or not (Shen et al. 2012). Therefore, “stock volume” as a predictor variable, plays a pivotal role in making the about decision stock investment. The stock-volume refers to the total number of shares of a specific stock traded on a stock exchange. In the stock exchange, the term “high volume” means bullish and “low volume” represent bearish. A sample list of entries for stock volume is presented in Table 7.

Table 7 Sample LUCK stock volume listing

3.2.5 Volume reduction by computation of average

As proposed by the Simon and Raoot (2012), volume reduction can improve the efficiency of the prediction model, therefore, we applied volume reduction by taking the monthly average of each independent and dependent variable (Eqs. 26).

In this study, we use a daily stock quotation to predict the future stock fluctuation based on different parameters. These daily quotations become a bulk of data with sparse nature of high and low values. To minimize such sparseness of data, we take the monthly average of each parameter. Due to such inconsistencies in data, prediction accuracy is decreased, as low-quality data will generate low quality mining results (Zhang et al. 2013). To increase the efficiency of the stock prediction model, data sparseness issue in terms of variant nature of data in volume, ROI, return and volatility, needs to be addressed (Nasseri et al. 2015). For this purpose, we propose to address the aforementioned issue by taking average of each parameter (volume, ROI, return, and volatility) as shown in Eqs. 26. It is observed experimentally (see Sect. 4) that the average values are less prone to sparseness and increase the accuracy of the result. The formula to calculate monthly average is given as follows:

$$monthly_{average} = sum{\text{\_}}of{\text{\_}}values/total{\text{\_}}value$$
(2)
$$monthly{\text{\_}}volatility_{average} = sum{\text{\_}}of{\text{\_}}values/total{\text{\_}}value$$
(3)
$$monthly{\text{\_}}volume_{average} = sum{\text{\_}}of{\text{\_}}values/total{\text{\_}}value$$
(4)
$$monthly{\text{\_}}ROI_{average} = sum{\text{\_}}of{\text{\_}}values/total{\text{\_}}value$$
(5)
$$monthly{\text{\_}}return_{average} = sum{\text{\_}}of{\text{\_}}values/total{\text{\_}}value$$
(6)

where the sum_of_values represents the summation of daily stock volume, stock return, ROI and stock volatility; and total_values shows the number of days in a month on which stock market transactions occur. For example, using Eq. 3 and its corresponding data values of the excel columns, we compute monthly-volatility average as follows:

$$monthly{\text{\_}}volatility_{average} = sum(H249 :H270)/22 = 0.270753017$$

A sample list of entries of average stock volatility, volume, return and ROI computations, taken from the KSE dataset are presented as follows (Table 8).

Table 8 Stock average volatility, volume, return and ROI computations

3.2.6 Return-on-investment (ROI)

The last parameter “Return-on-Investment (ROI)”, shows the profit gained by the investor with respect to traded stock (Kamley et al. 2013). It plays an important role in predicting the stock market trend, and can be computed by applying the following formula:

$$ROI = average{\text{\_}}monthly{\text{\_}}return/average{\text{\_}}monthly{\text{\_}}volume$$
(7)

where average_monthly_return shows the profit and average_monthly_volume represents the quantity of stock traded on a stock market. For example, if the average_monthly_return = 0.007393368, and the average_monthly_volume = 426,763.6364, then applying Eq. 7, we compute monthly ROI as follows: \(ROI = 0.007393368/426763.6364 = 1.73243{\text{E}} - 08\). A sample list of entries of ROI computations is presented in Table 9.

Table 9 Sample computations for LUCK stock return on investment (ROI)

3.2.7 Missing values

To handle data inconsistency for improving the efficiency of the prediction model, it is suggested that the missing values can be filled for data cleaning (inconsistency removal) (Feinberg and Genethliou 2005). There are three ways to fill in missing values: (i) by taking the average and put it to the missing place, (ii) to randomly select the entry and fill the missing value, and (iii) fill in the missing value with one step previous value. We have chosen the third option and therefore, the data cleaning is applied by manually filling the missing values with the relevant upper value. The missing values are mainly generated due to dividing by zero entry during the computation of dependent and independent variables at monthly basis.

figure a

3.3 Stock market analysis/predictions system

The stock market analysis/prediction system aims at analyzing the stock-related financial data by applying some prediction model and presenting information to users/investors to invest or not. An economy with high stock prices is considered as rising economy. The prediction system assists in decision making process regarding the purchase and selling of shares. Prediction aims at forecasting a future event (response variable), based on different parameters (independent variables).

In the stock market environment, stock return can be predicted by using the historical stock quotations: stock return, stock volatility, stock ROI, and stock volume values. The basic idea is to find a linear combination of stock volatility, ROI and stock volume that best predicts the stock return. It assist the investors in making financial decisions whether to invest or not in gaining maximum profit. Multiple Regression analysis can be used to infer causal relationships between the independent and dependent variables (Ladan et al. 2014). The formal notation of multiple regression model with n variables is presented as follows:

$$Y_{i} = \beta_{1} + \beta_{2} X_{2} + \beta_{3} X_{3} \ldots \beta_{n} X_{n} + \mu$$
(8)

where Y is the dependent variable, \(X_{2}\) and \(X_{3}\) up to \(X_{n}\) are the independent variables, \(\mu\) is the residual term. In Eq. 8, \(\beta_{1}\): It is the intercept term which gives a mean or average effect on \(Y_{i}\) of all the variables: \(X_{2}\) and \(X_{3}\) up to \(X_{n}\). \(\beta_{2}\), \(\beta_{3}\) up to \(\beta_{n}\): these are called the regression coefficients or slope, \(\beta_{2}\) measures the change in the mean value of Y with respect to per unit change in \(X_{2}\), such that the values of \(X_{3}\) and \(X_{4}\) remain constant. Similarly \(\beta_{3}\) measures the change in the mean value of Y with respect to per unit change in \(X_{3}\), such that the values of \(X_{2}\) and \(X_{4}\) remain constant; and \(\beta_{4}\) measures the change in the mean value of Y with respect to per unit change in \(X_{4}\), such that values of \(X_{2}\) and \(X_{3}\) remain constant (Zhang et al. 2013).

As the proposed stock prediction model is based on four variables, therefore, the general regression model (Eq. 8) can be formulated as follows:

$$Y_{i} = \beta_{1} + \beta_{2} X_{2} + \beta_{3} X_{3} + \beta_{4} X_{4} + \mu$$
(9)

where \(\beta_{1}\) is the intercept, \(\beta_{2}\), \(\beta_{3} ,\) and \(\beta_{4}\) are regression coefficients, and can be estimated using the least squares procedure (Zhang et al. 2013; Simon and Raoot 2012) to minimizes the sum of squares of errors. Therefore, using Eq. 9, the value of \(\beta_{1}\) can be computed as follows:

$$\beta_{1} = {\bar{\varUpsilon }} - \beta_{2} \overline{{X_{2} }} - \beta_{3} \overline{{X_{3} }} - \beta_{4} \overline{{X_{4} }}$$
(10)

Similarly, using Eq. 9, values of regression coefficients or slope: \(\beta_{2}\), \(\beta_{3}\) and \(\beta_{4}\), can be derived as follows (Zhang et al. 2013):

$$\beta_{2} = \frac{{\left( {\varSigma y_{i} x_{2i} } \right)\left( {\varSigma x_{3i}^{2} } \right) - \left( {\varSigma y_{i} x_{3i} } \right)\left( {\varSigma x_{2i} x_{3i} } \right)}}{{\left( {\varSigma x_{2i}^{2} } \right)\left( {\varSigma x_{3i}^{2} } \right) - \left( {\varSigma x_{2i} x_{3i} } \right)^{2} }}$$
(11)
$$\beta_{3} = \frac{{\left( {\varSigma y_{i} x_{3i} } \right)\left( {\varSigma x_{2i}^{2} } \right) - \left( {\varSigma y_{i} x_{2i} } \right)\left( {\varSigma x_{2i} x_{3i} } \right)}}{{\left( {\varSigma x_{2i}^{2} } \right)\left( {\varSigma x_{3i}^{2} } \right) - \left( {\varSigma x_{2i} x_{3i} } \right)^{2} }}$$
(12)
$$\beta_{4} = \frac{{\left( {\varSigma y_{i} x_{4i} } \right)\left( {\varSigma x_{2i}^{2} } \right) - \left( {\varSigma y_{i} x_{2i} } \right)\left( {\varSigma x_{2i} x_{3i} } \right)}}{{\left( {\varSigma x_{2i}^{2} } \right)\left( {\varSigma x_{3i}^{2} } \right) - \left( {\varSigma x_{2i} x_{3i} } \right)^{2} }}$$
(13)

The proposed modsel uses: (i) three independent/predictors (volatility, volume and ROI), (ii) a dependent/predicted variable(return), (iii) \(\beta_{1}\)(intercept), and (iv) three regression coefficients/slopes: \(\beta_{2}\), \(\beta_{3}\) and \(\beta_{4}\). Using aforementioned setup, the Eq. 9 of our proposed regression model can be calibrated as follows:

$$return = \beta_{1} + \beta_{2} \times volatility + \beta_{3} \times volume + \beta_{4} \times ROI$$
(14)

Computing the regression coefficients: To compute the Regression coefficients, intercept line and other statistics, Im() function of R software (Nasseri et al. 2015) is applied on the KSE stock dataset, a partial listing is presented in Table 10. The implementation of Im() function in R is shown as follows:

$$fit < - lm(\text{Re} turn\sim Volatility + Volume + ROI, data = tdata)$$
(15)

where Im() is function; Return is predicted variable; Volume, volatility and ROI are the predictor variables; tdata is the training dataset; and fit the object which holds the result/summary (for detailed summary, see Sect. 4).

Table 10 Monthly return, monthly volatility, monthly volume and monthly ROI quotation data

Intercept calculation: As mentioned earlier, the fit object contains a detailed summary of results obtained from the proposed prediction model, in which intercept value (\(\beta_{1}\)) is computed (Eq. 10), as follows:

$$\beta_{1} = - 3.378e ^{ - 03}$$
(16)

similarly, other regression coefficients: \(\beta_{2} , \;\beta_{3} ,\;\beta_{4}\) are also computed by the Im() function of R (Eqs. 1113) as follows:

$$\beta_{2} = 8.439e^{ - 04}$$
(17)
$$\beta_{3} = 2.638e^{ - 11}$$
(18)
$$\beta_{4} = 1.177e^{ + 08}$$
(19)

Finally, putting the values of \(\beta_{1} , \;\beta_{2} , \;\beta_{3} , \;{\text{and }}\;\beta_{4}\). in Eq. 14, we get the trained regression model as follows:

$$return = - 3.378e^{ - 03} + 8.439e^{ - 04} \times volatility + 2.638e^{ - 11} \times volume + 1.177e^{ + 08} \times ROI$$
(20)

The Eq. 20 is a trained regression model to predict the stock return by using volatility, volume and ROI. To calculate return, we put the values of the slope (\(- 3.378e ^{ - 03}\)) and regression coefficients (\(8.439e^{ - 04}\),\(2.638e^{ - 11}\), and \(1.177e^{ + 08}\)) along with the values of test variables: volume, ROI, and volatility. In this way, the stock return of the next month can be predicted using Eq. 20 and applying predict() function of R software, as follows:

We can predict the next month stock return, using Eq. 20, by applying the R predict function as follows:

$$predict(fit,newdata)$$
(21)

where fit is an object holding the values of trained model computed using Eq. 15 and newdata is a test data parameter computed as: newdata = data.frame (volatility = 0.226194175, volume = 96,293,972.63, ROI = 3.99E−11), where volatility, volume and ROI are the test/predictor variables. After finding the values of fit and newdata, we get: predict(fit, newdata) = 0.0040481.

In Table 10, the monthly return (0.0038456) for Aug-14 (highlighted in the red color font) shows that the prediction made by our proposed model (predict(fit, newdata) = 0.0040481) is quite accurate.

3.4 Proposed algorithm and implementation

The complete algorithm of the proposed model is shown as follows:

figure b

The proposed stock prediction model is implemented using R software (Kabacoff 2015), a widely used platform for data analysis and visualization purposes. R contains a bundle of packages and libraries for statistical analysis, data extraction, and graphics. It is an open source software, and freely available from the Comprehensive R Archive Network (CRAN) at (http://cran.r-project.org). To perform experiments, we randomly split the dataset into training (80%) and testing (20%), this split ratio of training and testing data assists in achieving promising results.

The implementation procedure of experiments conducted on three datasets is presented as follows (step#1 to step#9): (1) Data collection: The required stock data is extracted using the first step of the algorithm 2, (2) Pre-processing: the data acquired at step#(1) is pre-processed by applying the second step of the algorithm 2, which calls the pre-processing algorithm (algorithm 1), (3) File saving: the preprocessed data obtained at step#(2) is saved as.csv(comma separated value) file, which is subsequently made input to the R software, (4) File reading: to read.csv file, we used “read.csv (file. Choose(), header = T)”, option of the R software, as shown in the Fig. 3, after applying the read.csv() function, contents of the input dataset are automatically stored in the data1 object to be used in subsequent analysis, (5) Content display: To display contents of the data1 object (read file), we write data1 and press enter; Resultantly, one can visualize contents of all of the variables (predictors and predicted), as shown in Fig. 4, (6) Model training: In this step, multiple regression model is trained by applying Imo() function on the input data (tdata object) along with required parameters, the model’s details are stored in the fit object, as shown in Fig. 5, (7) Model statistics: After training the multiple regression model (step# 6), the next task is to show the detailed summary of the model using summary() function. Figure 6 shows different parameters of the implemented model, (8) Regression model generation: Using the summary details generated in step# 7, we can build the model regression (using Eq. 14), by incorporating different parameters like slope/intercept, regression coefficients as follows:

$$return = - 3.527e^{ - 03} + 1.838e^{ - 03} \times volatility + 2.61e^{ - 11} \times volume + 1.200e^{ + 08} \times ROI$$
(22)

Equation 22 is the proposed trained model regression equation and is able to predict stock return based on the values of volatility, volume, ROI (return on investment) and others parameters like slope and regression coefficients. Further details of the proposed model are discussed in Sect. 4; (9) Prediction: predict() function is applied by using fit and vdata (validation data) as function parameters to test the model performance(Fig. 7); Call Recommendation() function: This Recommendation() function, contains different parameters, namely: fit, volatility, volume and ROI, and Return[], which are passed to the predict() function to return a Predicted_Val, if the Predicted_Val is greater than the actual return values, increment the counter variable by one; finally, if the counter value is greater or less than the threshold-value (70% of the total instances) then print a recommendation message, i.e. “invest on stock” or “not invest on stock” (see Figs. 8, 9).

Fig. 3
figure 3

Read dataset using R software

Fig. 4
figure 4

Show the read data using R tool

Fig. 5
figure 5

Apply Im() R function to train the Regression model

Fig. 6
figure 6

Model statistics by applying R Summary(), function

Fig. 7
figure 7

Apply predict() function using R software

Fig. 8
figure 8

Recommendation using R software

Fig. 9
figure 9

Recommendation using R software

4 Results and discussion

The earlier studies (Ladan et al. 2014; Kamley et al. 2013; Javaid 2010; Ariyo et al. 2014; Devi et al. 2013) used open, high, low, inflation rate, exchange rate, interest rate, returns, volatility, trading volume and index close price. However, they achieved low accuracy, whereas, this work uses different stock-related attributes including stock closing, stock volatility, volume, ROI (return on investment) and stock return to increase the efficiency of the proposed work by improving the accuracy. The mentioned variables have some sort of significant relationship among them. To check the significance of each parameter, T-test is applied (Ahmadifard et al. 2013). Table 11 shows the relative significance of each parameter in all of the three models.

Table 11 Parameters relationship and significance

Table 11 gives experiment-wise detail of different predictors, their t-value, and associated p-value. The T-value column (T-statistics) shows the significance of the relationship between dependent and independent variables. In experiment no. 1, stock volatility reflects a positive but insignificant relationship with stock return; stock volume has a positive and significant relationship with the stock return, which means that increase in stock volume will an bring increase in stock return; and ROI also has a positive and significant relationship with stock return. Table 11 shows that experiment no. 2 also yields the same result as experiment no. 1. The experiment no. 3 shows that the stock volatility establishes a positive and significant relationship with the stock return. Similarly, stock volume has a positive and significant relationship and finally, the ROI has a positive but significant relationship with stock return. The proposed regression-based prediction model provides an efficient prediction of the stock trend.

The results reported in Table 11 gives following findings:

On KSE 100 Index, there is a significant positive relationship between the response variable “stock return” with predictor variables: (a) ROI and (b) stock volume. Therefore, the model has a strong tendency of predicting the stock trend efficiently.

On Lucky Cement Stock, there is a significant positive relationship between the response variable “stock return” with predictor variables: (a) ROI and (b) stock volume. Therefore, there the model has a strong tendency of predicting the stock trend efficiently.

On Engro Fertilizer Limited, there is a significant positive relationship between the response variable “stock return” with predictor variables: (a) stock volatility (b) Stock Volume and (c) ROI. Therefore, our model shows a strong tendency of predicting the monthly stock trend.

4.1 Coefficient of determination (multiple R-square)

Prediction accuracy shows that how efficient the model is with prediction ability with respect to forecasting the future stock market up and down. To determine the model accuracy, multiple r-square is used, which is also called the coefficient of determination. It shows that how much variation in stock return is explained by predictor variables (Nagar and Hahsler 2012). The Table 12 shows that, in experiment no. 1, we used KSE 100 index as an input dataset to the multiple regression model and obtained 95% prediction accuracy, which indicates that 95% variance in stock return is explained by predictors. Similarly, in experiment no. 2 Lucky Cement Stock dataset is used to train the model by receiving an 89% prediction accuracy. Finally, in experiment no. 3, Engro Fertilizer Limited stock data is used by achieving 97% prediction accuracy.

Table 12 Model accuracy on three datasets

Value of multiple R-squared in Table 12 is inspected. Using “KSE 100 Index” dataset, the value of multiple R-squared is 0.95, which depicts that the predictor can measure the variance in stock return up to 95%. While applying our model on “Lucky Cement” dataset, the value of multiple R-squared is 0.89, which depicts that the predictor can measure the variance in stock return up to 89%. Finally, the “Engro Fertilizer Limited” dataset yields a 0.97 value of multiple R-squared, which reflects that our model has the 97% prediction accuracy.

4.2 Evaluating the fitness of proposed model

To evaluate the fitness of proposed regression model, F-test is applied. The F-test evaluates the overall relationship between the target and the set of predictors (Simon and Raoot 2012). Results reported in Table 13 show the F-value along with probability values obtained from the three experiments. The “F-statistics” column of the experiment on “KSE 100 index” dataset reports a value of 103.4, whereas its corresponding “probability value/significance level” is 0.0001, showing that the target/response variable has a significant relationship with the set of predictors. Conclusively, our model is good-fit for the aforementioned dataset. The “F-statistics” column of the experiment on “Lucky Cement” dataset reports a value of 42.95, whereas its corresponding “probability value/significance level” is 0.001, showing that the target/response variable has a significant relationship with the set of predictors. Conclusively, our model is good-fit for the aforementioned dataset. The “F-statistics” column of the experiment on “Engro Fertilizer Limited” dataset reports a value of 191.7, whereas its corresponding “probability value/significance level” is 0.01, showing that the target/response variable has a significant relationship with the set of predictors. Conclusively, our model is good-fit for the aforementioned dataset.

Table 13 F-statistics and model fitness

4.3 Proposed model assist individuals and companies to invest or not in the stock market by providing user friendly interface

To assist the investors and companies in taking a decision regarding making investments in the stock market, we provide the user-friendly interface using R libraries and functions. Different textboxes, labels and button controls are created by using R, gWidgets_0.0-541, digest_0.6.102 and gWidgetstcltk_0.0-553 library. Users can enter data: volatility, volume and ROI in the textboxes, after pressing the button, suggestions are displayed, such as “invest money on stock” or “not invest money on stock”, by propagating the data into the fitted model (see Figs. 8 and 9).

4.4 Comparison of proposed model with state-of-the-art methods

Finally, we compare our proposed stock prediction model with the state-of-the-art methods (Ladan et al. 2014; Kamley et al. 2013; Javaid 2010).

In their work on stock prediction, Ladan et al. (Ladan et al. 2014), used multiple regression model to predict stock prices, by taking Inflation rate, exchange rate and interest rate as independent variables and all share index as a dependent variable. The used dataset obtained from CBN Statistical Bulletin (2010), by achieving 54% prediction accuracy, it means that 54% variability in all share index is explained by predictors.

Kamley et al. (2013) used multiple regression model to predict stock prices, by taking open, high and close as independent variables. They used “Infosys Company” dataset obtained from “Yahoo finance” by achieving 89% prediction accuracy.

Javaid (2010) used multiple regression model to predict market shares of companies on different parameters, namely: KIBOR, dividend, earning per share, gross domestic product, and inflation. They used Karachi-Stock-Exchange (KSE) 30-Index dataset and achieved up to 62% prediction accuracy.

As compared to the aforementioned baseline studies, we used four variables including Stock-return, volatility, volume, and return-on-investment, which are derived using a computation and pre-processing algorithm (Algorithm 1). The computation and pre-processing algorithm is used to normalize the sparse nature of data for achieving better prediction results. Table 14 shows a comparison of our method with the baseline methods.

Table 14 Comparison with the state-of-the art stock prediction methods table

We applied different machine learning techniques, such as Decision Tree, Naïve Bayes, Neural Network, SVM, and our model on Apple Inc Dataset listed on NASDAQ (https://www.nasdaq.com/symbol/aapl/historical). Table 15 shows that the proposed model has achieved best results in terms of average accuracy as compared to other models.

Table 15 Comparison with the existing stock prediction methods on Apple Inc Dataset listed on NASDAQ

To elaborate the models type 1 (precision) and type 2 errors (recall), and to compare with other models, we present the commentary as follows: the recommendation module of our system assists in making decision about whether to purchase or sell the stock. To evaluate the robustness of the model, type 1 (precision) and type 2 (recall) errors are elaborated by running the leading models and comparing the results obtained from our proposed system. Following are the formulas for computing precision, recall and accuracy.

$$Precision (p) = \frac{tp}{tp + fp}$$
$$Recall (r) = \frac{tp}{tp + fn}$$
$${\text{Accuracy}} = \frac{tp + tn}{{{\text{tp }} + {\text{fp}} + {\text{tn}} + {\text{fn}}}}$$

where tp is the number of true +ive values, fp is the number false +ive values, tn is the number of true −ive values, and fn is the number of false −ive values.

We compute aforementioned metrics using Samsung Electronic (https://www.nasdaq.com/symbol/ssnlf/historical) listed on NASDAQ (Qin et al. 2017). The results are listed in Table 16, showing effectiveness of the proposed system with respect to comparing models.

Table 16 Comparison with leading models

We discuss the generality of your model across the different stock markets. The performance evaluation results shown in Tables 17, 18 and 19 across different datasets (Samsung Electronic (https://www.nasdaq.com/symbol/ssnlf/historical, General Electric (https://www.nasdaq.com/symbol/ge/historical, and Baba (Alibaba) (https://www.nasdaq.com/symbol/baba/historical) for a trading time frame of 1 to 6 months, show that the proposed system achieved promising results in terms of precision, recall and accuracy.

Table 17 Efficiency of the proposed model across samsung electronics dataset listed on NASDAQ
Table 18 Efficiency of the proposed model across Baba (Alibaba) Dataset listed on NASDAQ
Table 19 Efficiency of the model across general electric dataset listed on NASDAQ

5 Conclusions and future work

This work presents the results of stock prediction by applying a multiple regression model using R software. It explores the stock prediction model using historical dataset. Different experiments are performed on different datasets to evaluate the accuracy of the proposed prediction model and measure the parameter relationship. The proposed technique consists of the following steps: (i) historical dataset about different stocks are obtained from a financial resource in “data acquisition” module, (ii) pre-processing is performed on the acquired dataset, by using “Computation and Pre-Processing of Stock Market Indicators” to prepare it for further processing. Various formulas are applied to compute stock change, volatility and return. To minimize data inconsistency, data cleaning is applied by fill-in missing values (handling divide by zero) technique; moreover, volume reduction is performed by computing an average of each parameter. (iii) the next step is “data splitting”, which splits the data into training and validation sets with 80:20 ratio, (iv) the multiple regression model is trained on the training data by using an “Algorithm 1”, and then the model performance is checked on the basis of validation data; model summary shows information about the proposed model, such as parameter significance, prediction accuracy, and overall model fitness, the obtained results show that when KSE 100-index dataset is used to train the model, it gives 95% prediction accuracy, and Lucky Cement yields 89% prediction accuracy, and the prediction accuracy of Engro Fertilizer Limited dataset is 97%, and (v) finally, user-friendly interface is provided to assist individuals and companies to invest or not in the specific stock.

The proposed stock market prediction model give betters prediction accuracy as compared to the state-of-the-art methods, however, different limitations are identified as follows: (i) prediction accuracy is comparatively low by using the three predictors, (ii) only monthly stock return is predicted by the proposed model, and (iii) the existing regression-based analytical approach for stock prediction has performance issues due to incorporation of only quantitative nature of data.

There are different directions for future extensions, such as (i) prediction accuracy can be increased further, if more predictors are added in the multiple regression model, (ii) the weekly and daily stock return can also be predicted instead of merely predicting the monthly stock return, (iii) to make the prediction model more robust, lagging of variables before their use in the function is required to be incorporated, and (iv) hybrid model in conjunction with quantitative and qualitative variables can be used to predict stock market trend more efficiently.