Homogenous Ensemble of Time-Series Models for Indian Stock Market

Yadav, Sourabh; Sharma, Nonita

doi:10.1007/978-3-030-04780-1_7

Sourabh Yadav¹⁸ &
Nonita Sharma¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11297))

Included in the following conference series:

International Conference on Big Data Analytics

1565 Accesses
12 Citations

Abstract

In the present era, Stock Market has become the storyteller of all the financial activity of any country. Therefore, stock market has become the place of high risks, but even then it is attracting the mass because of its high return value. Stock market tells about the economy of any country and has become one of the biggest investment place for the general public. In this manuscript, we present the various forecasting approaches and linear regression algorithm to successfully predict the Bombay Stock Exchange (BSE) SENSEX value with high accuracy. Depending upon the analysis performed, it can be said successfully that Linear Regression in combination with different mathematical functions prepares the best model. This model gives the best output with BSE SENSEX values and Gross Domestic Product (GDP) values as it shows the least p-value as 5.382e−10 when compared with other model’s p-values.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Forecasting of Indian Stock Market Using Time-Series Models

Time Series Analysis of National Stock Exchange: A Multivariate Data Science Approach

A Survey on Machine Learning Based Approaches for Stock Market Prediction

Keywords

1 Introduction

Stock market of any country is the key factor for determining the country’s growth and economy. Stock market is a place where all the public listed companies trades their shares to raise their capital. Looking at the historical trends of the stock market, predicting the stock prices will not be an easily accomplished task. Therefore, predicting the stock market prices will definitely prove to be a great helping hand for those who invest in the stock market. It will help to determine the country’s growth and economy for the future, which will assist the higher officials of any country for framing their policies for the development of the nation. Moreover, it will help the general public to understand the trends of the market, and when and how much one can invest for getting the maximum returns. There are several parameters of stock market, and BSE SENSEX is one of them. The BSE SENSEX, also called BSE 30 or SENSEX is a free floated market-weighted index of 30 well established and financially sound companies, listed on Bombay Stock Exchange.

Stock market trend analysis is one of the difficult tasks because of the daily ups and downs in prices of the stock. Hence it is important to build an accurate and precise prediction model for predicting the stock prices. Further, there are various approaches to analyze the stock prices, but the statistical approach for analyzing the prices is one of the most widely used approach [11]. Statistical Analysis is collecting, exploring and presenting the data for understanding the patterns and trends (if any) present in the dataset. Furthermore, if time series approach is used, it will provide the better accuracy and precise prediction model [16]. Time Series for any dataset is an existence of data over a continuous time interval. Time series analysis is analyzing the time series data for the better understanding of the trends and patterns. There are various parameters of stock market, and BSE SENSEX is one of them. Moreover, there are many other additional factors which affect the BSE SENSEX, like Gross Domestic Product (GDP), Inflation, Exchange Rates like the value of US Dollar in Indian Rupee, and many other [1]. GDP for any country is the final value of the goods and services manufactured within the geographical boundaries of any country during a particular period of time. Inflation is the measure of the increase in prices of goods and services in a country annually. Exchange Rate is defined as the price of a country’s currency in terms of another country’s currency. If one sees these factors mathematically, these factors are directly proportional to the increase and decrease of the values of stock market prices.

This manuscript specifically targets for predicting the BSE SENSEX depending upon the historical values [17] and factors affecting the BSE SENSEX. Performing the univariate analysis or understanding the historical trends in the dataset, provides a model for predicting the stock prices depending upon past values. The historical data of the past 18 years was analyzed and best fit model is prepared depending upon the mean error of various forecasting models. Various forecasting models when applied in combination with each other and compared simultaneously, gives the best output [3]. Depending upon the results and errors of various forecasting model, error matrix is prepared for better understanding. To increase the accuracy of the results found in univariate analysis, the next target was of multivariate analysis. Hence, the next target is to determine the correlation values among the BSE SENSEX vector and all the factor affecting BSE SENSEX. Depending upon the correlation values, the correlation matrix is prepared to judge highly affecting factor. Moreover, multivariate analysis of the dataset provides a mathematical relation between highly affecting factors and prices of stock. Hence, the next target was to create a mathematical relation between BSE SENSEX values and additional factors affecting the BSE SENSEX. Further, one ensemble is also prepared, to improve the accuracy and precision of the model. Then, in the end, all the results are compared to find the best model for forecasting. The data used in the analysis is of 18 years for predicting forecasting model on the basis of univariate analysis and is of 15 years for performing multivariate analysis.

2 Problem Statement

The Main objective covered in this manuscript is to predict the BSE SENSEX value accurately and precisely. To achieve this objective, there are some sub-objectives. First sub-objective is to predict the BSE SENSEX value by univariate analysis or by analyzing the historical values and trends in the dataset and obtain a suitable forecasting model which give the least mean error. Second sub-objective is to improve the accuracy of the model by analyzing the factors affecting the BSE SENSEX and performing the multivariate analysis on most affecting factor in which mathematical equation comes out as an output. Third and last sub-objective targets to build an ensemble.

3 Proposed Method

Figure 1 depicts the proposed methodology for building the precise and accurate prediction model.

3.1 Univariate Analysis

Step 1 for preparing the prediction model is to create a forecasting model depending upon the previous trends in the dataset or univariate analysis. Univariate analysis is one of the simplest forms, for analyzing the dataset in which previous values or historical values of the dataset is used for performing the analysis. ‘Uni’ means one, ‘variate’ here means variable, so one variable analysis is known as univariate analysis. For performing analysis on any dataset, univariate analysis acts as a basic step. Under this step, the first step is Data Collection. Data collection is the process of collection of data from all the relevant sources in a systematic fashion that enables one to answer the relevant questions and evaluate outcomes [7]. After collecting the data, data cleaning is the next step. Data Cleaning refers to the process of removing invalid data points from the dataset [14]. Cleaning is the process of removing the data points which are disconnected from the effect and assumption which are needed to be isolated. In this process, these particular data points are ignored, and analysis has been conducted on the remaining data. After data cleaning, the next step is an exploratory analysis of the dataset. For exploratory analysis, data is loaded in the statistical environment for performing the different statistical functions on the dataset. Further, the dataset is converted in time series. This means that data exists over a continuous time interval with equal spacing between every two consecutive measurements. Converting the dataset into time series always proves to be an effective method for the analysis of any dataset, especially in the stock analysis [2]. The next step involves plotting the time series object for analyzing the components of the time series data i.e. trend, seasonality, stationarity, and heteroskedasticity. Among these components, stationarity is most important. When the mean and variance are constant for a particular dataset, it is said that dataset holds the stationarity. (i.e. their joint distribution does not change over time).

The Plot of the time series will suggest that whether the data is stationary or not, which further suggests that data is volatile or not. If the data is not stationary, then it means there is a large deviation from the mean of the dataset. The data which not stationary, it will be quite unpredictable. Further, for testing the stationarity, different tests are performed like the Ljung-Box Test and Augmented Dickey-Fuller Test.

Next Step involves Testing for Stationarity under which two test are performed on the dataset i.e.

Ljung-Box Test: The Ljung–Box test is a type of statistical test of whether any of a group of autocorrelations of a time series are different from zero.
Augmented Dickey-Fuller (ADF) Test: The ADF test is unit root test for stationarity. Unit roots can cause unpredictable results in your time series analysis. The Augmented Dickey-Fuller test can be used with serial correlation. In this lag length is the parameter which is important in finding the meaningful results. In this lag length is the parameter which is important in finding the meaningful results [9].

Moreover, the Null Hypothesis states that large the p-value indicates non-stationarity and smaller p values indicate stationarity [8]. If in ADF test results are not in favor i.e. p-value comes out to be relatively high, then there is a need to do some further visual inspection, otherwise, next step i.e. Decomposition of the dataset can be skipped. So, if test depicts the high p-value then next step is decomposition of the dataset. This involves breaking down the dataset into parameters that are, Observed, Trend, Seasonal, Random [5]. When the Seasonal vector is plotted, it gives us indication towards stationarity or non-stationarity. If the data is stationary, the first phase of model estimation can be skipped. Basically, Model estimation comprises of two phases i.e. in the first phase, non-stationary data is transformed into stationary data and second phase, building a model. So the next step is the Model estimation. Firstly Auto Correlation Function (ACF) plot and Partial Auto Correlation Function (PACF) plot are prepared. These ACF and PACF plots tell about the Correlation factor of statistical analysis. Moreover, it helps to judge Co-variance of the dataset. When there is large autocorrelation within our lagged values, then, in that case, there is a need to take the difference of time series object in order to transform the series into a stationary series. The Difference of the series means calculating the differences between all consecutive values of a vector. This will helps to stabilize the mean, thereby making the time-series object stationary. Next step is to plot the transformed time series. This plot will suggest that whether the series is now stationary or not. To confirm the stationarity, ACF and PACF plots are again plotted for the differenced time series which clears the doubt about the stationarity. Further stationarity can be tested by different tests like the Ljung-Box Test, Augmented Dickey-Fuller Test, which will give the p-value very small in comparison to the previous p-values, which again proves the stationarity. Next job is the Building of the Model, which means deducing that which particular model applies best on our dataset depending upon our statistical results. Different models are:

Autoregressive Integrated Moving Average (ARIMA) Model: ARIMA is a forecasting technique that projects the future values of a series entirely based upon its own inertia. Its main application is in the area of short-term forecasting requiring at least 40 historical data points. It works best when the data exhibits a stable or consistent pattern over time with a minimum amount of outliers. It is the preferred choice because of its simplicity and wide acceptability. It offers great flexibility to work upon univariate time series [12].
BoxCox Transformation: BoxCox transformations are generally used to transform non-normally distributed data to become approximately normal.
Exponential Smoothing Forecast: This forecasting method relies on weighted averages of past observations where the most recent observations hold higher weight. This method is suitable for forecasting data with no trend or seasonal pattern.
Mean Forecast: This forecasting method relies on the mean of the historical data.
Naive Forecast: The naive forecasting method which gives an output as ARIMA (0, 1, 0) with a random walk model that is applied to time series object.
Seasonal Naive Forecast: This forecasting method works almost on the same principles as the naive method, but works better when the data is seasonal.
Neural Network: Neural networks are forecasting methods that are based on simple mathematical models of the brain. They allow complex nonlinear relationships between the response variable and its predictors. This model is very helpful when combined with the statistical computational approach for forecasting of stock market [15].

The model which have the least error or have the higher accuracy will be the best fit model for the dataset. Moreover, error analysis suggests the improvements that can be made in the results in the future [6].

3.2 Multivariate Analysis

Step 2 involves multivariate analysis for improving the results of step 1. Multivariate analysis is a statistical approach in which dataset is analyzed on the basis of different factors and the main objective is to prepare a combined model for better performance, analysis and, accuracy. Many times, univariate analysis is preferred because multivariate analysis results are difficult to interpret. For multivariate analysis, there are enough number of techniques, so depending upon the type of datasets, a particular technique is followed. Multivariate analysis can be performed by just analyzing the trend of all the factors which can have a great impact on the main dataset. Multivariate analysis is performed with the factors which have great influence on the dependent variable. The Dependent variable is a variable which needs to be detected after the analysis. Principle Component Regression (PCR) technique is the most commonly followed technique for the multivariate analysis. This technique is simply based upon Principle Component Analysis. PCR focuses to reduce the Dimensionality of the datasets. Moreover, it avoids the multicollinearity among the predictor variables. Results from Step 1 can be improved if a relationship analysis is carried out among the dataset vectors and factors affecting the dataset. For Relationship analysis, the statistical approach used is Regression. Regression is the statistical approach which is used to build a model in terms of mathematical equations for determining the relationship among the different factors with the main variable. In Regression one of the variables is known as Predictor variable whose value is to carried out by performing different experiments, and another variable is Response Variable. Response Variable is a variable whose value is procured by Predictor Variable. Generally, there are seven types of Regression available which are listed below:

Linear Regression: It is the most commonly know regression modeling technique. In this type of modeling technique, there can more than one independent variable which can be either discrete or continuous, but dependent variable must be continuous and the nature of the line of regression should be linear [13].
Logistic Regression: It is the type of regression which focuses to determine the probability of the event i.e. either success or failure [10]. Logistic Regression must be used or preferred when the dependent variable is in binary form i.e. 0 or 1, True or False, Yes or No.
Polynomial Regression: It is the type of regression model in which regression equation is of the form polynomial i.e. independent variable has the power more than 1. In this best-fit regression line is not particularly a straight line.
Stepwise Regression: It is the type of regression in which multiple independent variables are required. The selection of the predictor variable is done automatically. There is no intervention of humans. Its basic aim is to produce a best fit model with the highest possible accuracy.
Ridge Regression: It is the type of regression model which applied when independent variables a high absolute correlation value or have multiple collinearities. In this alpha value is set to be a 0.
Lasso Regression: It is the type of regression model which similar to ridge regression model but just uses absolute values instead of squares in penalty function. Moreover, the alpha value is set to be 1 in this model.
ElasticNet Regression Model: It is the hybrid model of ridge and lasso regression model. In this alpha value is set as 0.5.

The Linear Regression approach is preferred over other regression approaches as all other regression approaches are build by understanding the working of linear regression [21]. A Key requirement for linear regression is linearity among the variables. Moreover, correlation values also help to judge the dependability of any response variable upon the predictor variable. The correlation values have the range of −1 to 1. So, larger the absolute value of the correlation coefficient, more the dependability of variables upon each other and more is the linearity among them. After determining the correlation value, the most influencing factor will be extracted. Furthermore, model fitting is done by applying different mathematical functions like logarithmic function or exponential function, on both response variable and predictor variable, for making models estimation simple and easier. Moreover, instead of passing a single factor i.e. most influencing factor, one can pass all the factors at the same time as an argument to the regression algorithm. Then whichever model performs better will be the best fit model. For determining the accuracy steps will be the same i.e. summarization of regression model.

3.3 Ensemble Technique

Step 3 involves the building of ensemble for the dataset. Ensemble, also known as Data Combiner, is a data mining approach that converges the strength of multiple models to achieve better accuracy in prediction. Basically Ensemble means combining multiple algorithms for improving the accuracy of the model. Ensemble method is one of the most influential developments in the field of data mining. They combine the multiple models into one, by extracting most accurate models from all those multiple models. Ensembling is performed depending upon the dataset. The necessary steps to perform the technique is outlined in below:

All the prediction models can be analyzed depending on its accuracy and precision. The model with more accuracy and precision will be the best-fitted model for our dataset.

4 Results and Discussions

The tool which is used for forecasting is R. Various Packages related to the various functionalities described in Sect. 3, are included as: forecast package, tseries package. Datasets used in our analysis are BSE SENSEX, GDP of India, USD Prices in Rupee, and Inflation and the sources for these datasets are mentioned in Table 1.

Table 1. Sources of dataset

Homogenous Ensemble of Time-Series Models for Indian Stock Market

Abstract

Similar content being viewed by others

Forecasting of Indian Stock Market Using Time-Series Models

Time Series Analysis of National Stock Exchange: A Multivariate Data Science Approach

A Survey on Machine Learning Based Approaches for Stock Market Prediction

Keywords

1 Introduction

2 Problem Statement

3 Proposed Method

3.1 Univariate Analysis

3.2 Multivariate Analysis

3.3 Ensemble Technique

4 Results and Discussions

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation