1 Introduction

Stock price prediction is an important task in the planning of investment activities. However, it remains a challenging problem to build an effective stock price prediction model, considering that stock prices are affected by multiple factors. In addition to historical prices and a series of technical indicators, the current stock price is also affected by social sentiment. The overall social mood toward a company may be one of the most significant variables affecting its stock price. Nowadays, with the rapid development of social media, an increasing number of investor posts are released on social media, making large amounts of sentiment data available.

Many prior studies have confirmed the validity of investor’s sentiments in stock market predictions [4, 55, 61, 63], even in the Bitcoin exchange market [87]. However, the social media information comprises texts in loose and unrestricted format which grow in a dynamic way. Therefore, this study attempts to integrate and make use of as much content as possible in the social environment of stock market to develop an effective stock prediction method that fully utilizes time series information.

Other drawbacks of previous studies involve using only snapshots of the dataset at time point t to predict another time point in the future [12, 83] and using models that were not tailored for deep sequential information [55]. This ignores the time series relationships among consecutive trading days before time point t, which is also a significant information hidden in the historical time series. LSTM network [29] is designed to learn sequential information, which has been verified to be superior to other models for the task of extracting effective information from complex financial time series data [35, 58]. Therefore, we believe it will help to improve the performances of our prediction method.

To address these questions, we employ four approaches that 1) propose a fine-tuned BERT sentiment classification model and a sentiment lexicon to construct sentiment analysis, 2) convert sentiment information into novel representation feature as model input, 3) build a ALSTM-based architecture to learn the deep sequential information via varying input window length, and 4) conduct experiments on a large scale of social media posts concerning 28 stocks for a period of three years.

This study makes four contributions, namely: (1) we introduce an ALSTM-based architecture for stock price prediction using stock price data, technical indicators and sentiment information, which performs better than the baseline models in both validation and test data sets using three different evaluation metrics; (2) we compare the model performance using different data source, the real effectiveness of sentiment analysis in stock prediction is demonstrated; (3) we propose a fine-tuned BERT sentiment classification model which shows good performance in sentiment classification task, and the exploitation of the sentiment feature computed with the use of the BERT model also lead to higher predicting accuracy compared with the feature calculated using sentiment lexicon; and finally, (4) we compare the predicting accuracy when using different input window length and found that setting time window to 5-day can improve the average predicting performance for all proposed models. The highest average predicting accuracy of 28 stocks is achieved when using sentiment feature calculated by the fine-tuned BERT model.

The rest of the paper is presented in the following. Section 2 introduces some related works on stock predictions based on price data and technical indicators, predictions combining sentiment analysis, and also predictions using long input window length. Section 3 describes our proposed methodology. Section 4 presents the detailed experimental process and assesses experimental results. Section 5 presents the discussion and implications. Finally, the last section concludes our contribution and proposes future works.

2 Related work

This section summarizes studies on (1) Domain 1: Stock predictions based on price data and technical indicators, (2) Domain 2: Stock predictions based on sentiment analysis, and (3) Domain 3: Stock predictions based on long input window length. Several research gaps are concluded through the summary.

2.1 Stock predictions based on price data and technical indicators

Stock market prediction has been an important task in both academics and businesses. Based on the Efficient Market Hypothesis (EMH) [18], some of the early studies propose that it is impossible, given the risk it may face, to achieve above-market returns over the long term. Therefore, the prediction accuracy of the stock market will not exceed 50% [71]. However, the EMH has been questioned ever since [31, 62], especially with the rapid development of machine learning models [5, 21, 64, 85]. Prediction accuracy of 56% is generally considered as satisfying results [73, 77].

Despite Fama’s hypothesis, there are two different philosophies of trading for stock market prediction [8]: fundamental analysis and technical analysis. The former analysis macroeconomic factors, a company’s financial conditions, while the latter assumes that future performance are related to certain historical patterns [75] like time-series prices. Several technical indicators are defined to represent these patterns including the moving average (MA) [24], exponential Moving Average (EMA) [37], momentum [43], Bollinger band [23], etc.

Some researchers tried to make stock predictions based on historical prices only [93, 94] or predict by using a small dataset [22]. Due to the low instance test set, the result may be insufficient. Stock markets generate large-scale trading data every day, providing large amounts of training data for deep neural network [47]. Fischer and Krauss [20] applied an LSTM-based model for financial time series predictions, and the result shows that the LSTM network performs better than memory-free classification models, i.e., a random forest, a logistic regression classifier, and a deep neural net.

Studies in Table 1 cover 4 main aspects of work: (a) stock market selection; (b) feature selection; (c) input window length; and (d) predicting method adoption. Each column corresponds to one aspect. As for selection of stock market, these studies choose a continuous period of time for training and testing. As for feature selection, it can be classified as price data (e.g. [28, 86]), or technical indicators (e.g. [93]), or both of them (e.g. [54, 59]). Input window length is the length of the input vector (e.g., 3d represents a 3-day time window). Some abbreviations are used for this field: ‘m’ is minutes and ‘d’ is days. A null value means no relevant information mentioned. As for predicting method adoption, it can be classified as (1) reduced-form models, such as ARIMA (e.g. [85]), GARCH (e.g. [25]) or (2) machine learning models, including Bayesian network (e.g. [94]), SVM (e.g. [5]), SVR (e.g. [88]), or (3) deep learning models, such as ANN (e.g. [9]), RNN (e.g. [3]), LSTM (e.g. [41, 58, 92]).

Table 1 Summary of studies based on price data and technical indicators

2.2 Stock predictions based on sentiment analysis

Sentiment analysis, which is mainly designed to understand what others are thinking [57], has been proved effective in many applications including movie reviews [39, 40, 80], product reviews [38] and public opinions [70, 81]. Nowadays, sentiment information extracted from social media for stock market prediction has also been proved to be effective [46, 60]. There are two main sources for the researchers to merge the information extracted from the text content into their financial models. In previous studies, the main source was the news [45, 67, 68], and in recent studies, social media sources [48]. Bollen, et al. [6] conducted the most influential study to gauge specific dimensions of Twitter sentiments in predicting Dow-Jones index and achieved higher predicting accuracy. Since this seminal study, sentiment extracted from Twitter [52, 82], Yahoo! Finance [56], Sina Weibo [83], GuBa [48], etc. has been proven to be highly correlated with the stock market. Xing, et al. [84] mentioned that it is insufficient for investors to make investments only based on public sentiment and other factors must also be considered in prediction models.

There are two main perspectives on sentiment analysis of text contents: sentiment lexicon [15, 30] and natural language processing [1, 32]. Picasso, et al. [61] extracted two distinct sets of sentiment features from sentiment texts based on the dictionary of Loughran and Mcdonald [50] and AffectiveSpace2 [7] separately. The former is a specific dictionary for financial applications while the latter is a vector space model which is designed to extract sentiments from structured content. Their results show that combining sentiments with price technical indicators outperforms using price data only. The employment of AffectiveSpace feature as input achieved higher accuracy, while the use of the features calculated by Loughran and McDonald dictionary achieves higher returns.

As shown in Table 2, these studies include 5 main aspects of work: (a) stock market selection; (b) feature selection; (c) input window length; (d) sentiment analysis method adoption; and (e) predicting method adoption. As for selection of stock market, these studies also focus on a continuous period of time. As for feature selection, these studies add sentiment information into feature set in the form of (1) polar sentiments (e.g. [45], (2) sub-categorical sentiments (e.g. [61, 69]), or (3) sentiment index (e.g. [31]). As for input window length, these studies also focused on a fixed input window length (e.g. [6, 48]). As for sentiment analysis method adoption, it can be classified as sentiment lexicon (e.g. [82]) or natural language processing (e.g. [56]). As for predicting method adoption, machine learning models, including SVM (e.g. [47]), SVR (e.g. [52]) and (2) deep learning models, such as LSTM (e.g. [12]), RNN (e.g. [83]) are commonly used.

Table 2 Summary of studies based on sentiment analysis

2.3 Stock predictions based on long input window length

Stock prediction can be viewed as a time series problem when using long input window length for model training. Given a univariate or a multivariate time-series, one may treat the entire time-series as a sample. There has been a lot of interest in predicting through long input window length, and it remains an active research area [15, 91].

Nguyen, et al. [55] extract information from two consecutive days for stock movement prediction. In their study, features of each day are considered to be a parallel relationship and used for the training of SVM. Shynkevich, et al. [72] employ technical indicators to describe the information about the past trend of the stock price. In their research, indicators are regarded as a snapshot of the current situation which also reflect the past behaviour over a certain period of time. Several machine learning algorithms are proposed to train these input features which are calculated from price data through different time span. With the rapid development of computer engineering, deep learning algorithms have been widely used in financial time series modelling tasks. Instead of using indicators calculated from different input window length, these studies consider higher-dimensional input data [17, 34], allowing deep learning networks to learn the hidden sequential information.

As shown in Table 3, the 5 main aspects in these studies include: (a) stock market selection; (b) feature selection; (c) input window length; (d) input data form; and (e) predicting method adoption. Stock market selection and feature selection are trivial. As for input window length, these studies use a relatively long time period (e.g. [49]), or several optional lengths for comparison (e.g. [53, 89]). As for input data form, it can be categorized as one-dimensional vector (e.g. [55, 72]) or high dimension vector (e.g. [42]). As for predicting method adoption, LSTM (e.g. [66, 79]) is most commonly used.

Table 3 Summary of studies based on long input window length

2.4 Summary

Through summarizing and comparing previous researches in above three domains, we identified three issues that warrant further investigation, which follows here.

The first issue is that many previous studies make prediction barely using stock price data and several technical indicators. The booming development of social media accelerates the dissemination of users’ opinions and sentiments [44]. Investors tend to seek for emotional help [19], leading the impact of sentiment opinions more significant than usual. Hence, sentiment analysis on social media posts own greater significance in stock prediction task.

The second issue is that the sentiment analysis approaches lack an in-depth understanding of the sentiment text content. Some of the semantics-based methods use sentiment lexicon to analysis the sentiment. However, since the sentiment of the whole content is judged by limited keywords, the deep sentiment in the text may be neglected due to the imperfection of the sentiment lexicon. In order to extract the deep sentiment, an efficient method should be developed. Therefore, we utilize BERT [16] in our sentiment analysis process, inasmuch as it has yielded better results for many NLP tasks including sentiment classification.

The last issue is that the previous studies fail to explore the impact of using long input window lengths on prediction performance. Although many previous studies consider taking long input window length for models to learn, the length number is usually fixed [45, 90], or the input data form lacks time series information [55]. The change of input window length may also result in variation in prediction performance but is seldom considered. Hence, it is of vital importance to discuss the difference of using different input window lengths in prediction.

To settle the three issues, this study build a prediction model based on ALSTM networks using three data sources as input: price data, technical indicators and sentiment feature. The sentiment feature is extracted from social media posts through two different sentiment analysis methods for comparison. The first one is a manually predefined sentiment polarity lexicon in the financial field, and the second one is a fine-tuned BERT sentiment classification model. Different length of input window is organized to feed the ALSTM networks for comparison. To our knowledge, this paper is one of the earliest attempts to reveal the impact of sentiment analysis via different window lengths for stock price prediction.

3 Methodology

An overview of the research framework is shown in Fig. 1. First, the sentiment posts are analysed and sentiment indicator for each transaction day are calculated. Then the sentiment indicators combining with the time series stock prices and technical indicators are organized as model input. Through learning the past N days’ features, the closing price of N + 1 day is predicted. Details of each part are explained in the following subsections.

Fig. 1
figure 1

Illustration of research framework

3.1 Price and technical indicators

In this study, 6 stock price indictors and 8 technical indicators are selected to construct the indicator set.

The stock price data comprises open, close, high, low price, turnover rata and trading volume. Technical indicators are wildly used for market states analysis [3]. Therefore, besides historical prices, we also employ several technical indicators as extra inputs for ALSTM networks which are shown in Table 4. These indicators can reflect the stock trends from multiple aspects, which provides rich stock market signals for the ALSTM networks to learn. However, these technical indicators may not have exact values at every single day due to the different time configuration. Therefore, transaction days with missing values are removed to ensure the integrity of the time series data.

Table 4 Meanings of technical indicators

3.2 Sentiment analysis

The sentiments analysis module in Fig. 1 classifies sentiment posts into three categories: positive, negative, and neutral, according to the beliefs or expectations expressed: a positive post means that the mentioned stock price is supposed to rise in the nearly future, or it shows the poster’s tendency in buying this stock; a negative post indicates the expectation in price falling or the tendency of selling this stock; and a neutral post means no obvious expectation or recommendation shown in the post and poster has no tendency in trading. These user-generated text contents are processed by two sentiment analysis methods in this study for comparison: a manually constructed sentiment lexicons and a fine-tuned BERT model for sentiment classification.

3.2.1 Sentiment lexicon

Sentiment dictionaries have been widely used in transforming sentimental contents into representations. In this experiment, the National Taiwan University Sentiment Dictionary was used as basic lexicon and extra finance related terms were manually added. These terms are regarded as rise/fall relevant terms which were summarized from online posts and relevant studies for making up the lack of relevance between the original lexicon and the stock market. The new lexicon contains two polar sentiments: positive and negative. Words which are not exist in our lexicon is regarded as the third sentiment dimension – neutral. Based on the natural language processing, three steps are employed to process these online posts. The first step is Chinese word segmentation and unwanted word removal. Unwanted word such as stop words and special characters (@, #, $ etc.) has no role during classification process. By this step, the text sequences for each post is obtained. The second step is sentiment word matching. Through this process, the text sequences are matched with our sentiment lexicon, which mark words with tags “positive”, “negative” and “neutral”. The third step is post sentiment calculation. The sentiment polarity of post j is calculated through Eqs. (1)–(4).

$$ { Pos Count}_j=\underset{i=1}{\sum \limits^T}{Pos}_{\left(i,j\right)} $$
(1)
$$ { Neg Count}_j=\underset{i=1}{\sum \limits^T}{Neg}_{\left(i,j\right)} $$
(2)

j represents the number of posts; i represents the ith word in text sequence. The Pos(i, j) or Neg(i, j) indicates weather the ith word is positive or negative respectively. When the word appears in the positive part of our lexicon, PosCountj is employed as the total positive number. When the word appears in the negative part of our lexicon, NegCountj is employed as the total negative number. In this study, PosCountj and NegCountj are used to represent the extent of expectation on rise and fall.

$$ {D}_j={PosCount}_j-{NegCount}_j $$
(3)
$$ {Sent}_j=\left\{\begin{array}{c}\mathrm{Positive}\ \mathrm{if}\ {D}_j>0\\ {}\mathrm{Neutral}\ \mathrm{if}\ {D}_j=0\\ {}\mathrm{Negative}\ \mathrm{if}\ {D}_j<0\end{array}\right. $$
(4)

Through Eqs. (3) and (4), the magnitudes of PosCountj and NegCountj are compared. When PosCountj is larger than NegCountj, it means the post has more expectation in the rising of the stock price, and vice versa. Dj is calculated in accordance with PosCountj and NegCountj to classify the polarity of the post j into positive, negative and neutral. These marks of sentiment polarity are employed to construct sentiment indicators.

3.2.2 BERT-based sentiment classifier

Besides sentiment lexicon, we also employ BERT, a pre-trained language model based on deep bidirectional Transformers [78], to perform sentiment classification task. We also take advantage of fine-tuned BERT for sentence-level sentiment classification as it has produced state-of-art results for many NLP tasks [26]. The output of this multi-class, single-label sentiment classifier is the predicted probability of each class, and we get the final predicted category (positive, negative or neutral) according to the output probability.

A natural idea for fine-tuning is to further pre-train BERT with target domain data [74] since BERT was trained in the general domain. In this study, we directly fine-tune the pre-trained BERT model with task-specific dataset, which is constructed using randomly selected data from GuBa dataset. The sentiment polarity of each text is manually labelled in the following process. First, we unified the sentimental annotation guideline in the financial fields. Second, a group of five coders completes the first round of sentiment annotation. Then another group of five coders completes the second round of sentiment annotation for the same text contents. Inconsistencies in annotation are judged by a five-coder verification team under final discussion. Finally, it was used in the fine tuning process for the specific task. In this way, we reduce the limitation of the model performance and endow the model with rich sentiment knowledge.

3.2.3 Construction of sentiment indicators

Sentiment indicators are constructed through sentiment indicators construction method in Fig. 1 based on the sentiment classification results. Following [2, 10, 33], we adopt the bullishness indicator, which is defined as Eq. (5),

$$ {B}_t=\frac{M_t^{pos}-{M}_t^{neg}}{M_t^{pos}+{M}_t^{neg}} $$
(5)

where \( {M}_t^c={\sum}_{i\in D(t)}{w}_i{x}_i^c \) is the weighted sum of messages of type c ∈ {pos, neg, neu} in the time interval D(t). \( {x}_i^c \) is equal to one when post i is type c and zero otherwise, and wi is the weight of the post. Antweiler and Frank [2] reveal that the alternative weighting schemes make no difference to conclusions and employ the equal weighting. Therefore, we also regard \( {M}_t^c \) as the number of posts of different categories. Antweiler and Frank [2] propose another bullishness indicator, which is shown in Eq. (6):

$$ {B}_t^{\ast }=\ln \left[\frac{1+{M}_t^{pos}}{1+{M}_t^{neg}}\right] $$
(6)

In order to reflect the number of investors expressing a certain sentiment, they provide an alternative method of calculation, as shown in Eq. (7):

$$ {B}_t^{\ast}\approx {B}_t\ln \left(1+\left({M}_t^{pos}+{M}_t^{neg}\right)\right) $$
(7)

The second measurement of \( {B}_t^{\ast } \) outperforms the other one in their research. However, neutral posts are not considered in these bullishness indicators. The neutral posts can also reflect the investors’ attention on a particular stock even if they do not contain obvious expectations or beliefs. Considering a more comprehensive investor attention, we propose the following investor sentiment indicator \( {B}_t^{all} \), as is shown in Eq. (8),

$$ {B}_t^{all}={B}_t\ln \left(1+{M}_t\right) $$
(8)

where Mt is the total number of posts at time interval D(t). Mt changes with the investors’ attention and is not influenced by the sentiment classification methods.

3.3 Attention-based LSTM networks

In this study, attention-based LSTM networks are chosen as prediction model. LSTM has similar architecture with Recurrent Neural Network (RNN). Recurrent neural network is able to learn temporal patterns from sequential data through internal loops. Weights is learned by backpropagation which has difficulties in retaining long-term information, and may confronts the problem of vanishing (or exploding gradients). LSTM models were proposed to solve these problems [29], and the biggest difference is that there exist three more gates in LSTM.

These gates determine whether each data can pass through the gate and enable LSTM networks to learn long-term dependencies. These three gates are the input gate, forget gate, and output gate. An input gate indicates whether new information can be added into the LSTM memory. A forget gate decides what information should be abandon. An output gate controls whether to output the state. The calculations for the integral process are performed as the following formulas:

$$ {f}_t=\upsigma \left({W}_f\left[{h}_{t-1},{x}_t\right]+{b}_f\right) $$
(9)
$$ {i}_t=\upsigma \left({W}_i\left[{h}_{t-1},{x}_t\right]+{b}_i\right) $$
(10)
$$ {\overset{\sim }{C}}_t=\mathit{\tanh}\left({W}_c\left[{h}_{t-1},{x}_t\right]+{b}_c\right) $$
(11)
$$ {C}_t={f}_t\ast {C}_{t-1}+{i}_t\ast {\overset{\sim }{C}}_t $$
(12)
$$ {o}_t=\upsigma \left({W}_o\left[{h}_{t-1},{x}_t\right]+{b}_o\right) $$
(13)
$$ {h}_t={o}_t\ast \tanh \left({C}_t\right) $$
(14)

where, Wf, Wi, Wc, Wo are weight matrices, bf, bi, bc, bo are bias vectors, ht is the memory cell value at time t, σ calculates how much data to keep, ft is the value of the forget gate layer, it shows the values of the input gate, \( {\overset{\sim }{C}}_t \) is the total data reserved at time t, Ct indicates the current cell state, ot is the output gate layer. The LSTM model comprises these memory blocks and is capable to learn longer temporal patterns.

Attention mechanism is introduced to the LSTM networks, which will adaptively assign different attention weights to different features. After forming the feature vector H = {h1, h2, …hT} through the hidden layer, the attention mechanism will look for the attention weight αi of hi, and the attention mechanism formula is as follows:

$$ {e}_i=\tanh \left({W}_h{h}_i+{b}_h\right),{e}_i\in \left[-1,1\right] $$
(15)
$$ {\alpha}_i=\frac{\exp \left({e}_i\right)}{\sum_{i=1}^t\exp \left({e}_i\right)},{\sum}_{i=1}^t{\alpha}_i=1 $$
(16)

where Wh is the weight matrix of hi. The output of the attention mechanism can be obtained as:

$$ \left[{h}_1^{\ast },{h}_2^{\ast },\dots, {h}_T^{\ast}\right]=\left[{h}_1,{h}_2,\dots, {h}_T\right]\ast \left[{\alpha}_1,{\alpha}_2,\dots, {\alpha}_T\right] $$
(17)

where the above ∗ operation is number multiplications componentwise. That is, \( {h}_j^{\ast }={h}_j\ast {\alpha}_j,j=1,2,\dots, T \).

4 Experiments

4.1 Dataset

Two datasets are employed in stock price prediction process. The first one is the stock prices and technical indicators dataset, and the second one – the sentiment information dataset. Stock prices and technical indicators come from the RESET Financial Database (www.resset.com), while the sentiment information comes from GuBa (http://guba.eastmoney.com).

4.1.1 Stock price and technical indicator dataset

All 28 pharmaceutical stocks in the CSI 300 are chosen to conduct experiment. Stock historical prices and technical indicators are collected for a period of three years (from November 18, 2016 to November 18, 2019). Stock codes and company names are shown in the Table 5.

Table 5 Stock codes and company names

There are three reasons for choosing the 28 pharmaceutical stocks in the CSI 300 stocks:

  1. 1.

    Csi 300 stocks have higher capitals comparing with others in the whole A-share market, which means there are more discussions in the GuBa.

  2. 2.

    Negative news about pharmaceutical and biological companies continues to emerge. Increasing attention has been drawn from Chinese society, such as the fraud case of DEEJ and the expired honey case of Tongrentang Chinese Medicine.

  3. 3.

    Choosing stocks in the same industry can reduce the negative impact of the industry factors on stock price prediction.

4.1.2 GuBa dataset

For sentiment indicators constructing, expectations and beliefs need to be extracted from online posts. Text contents of the 28 stocks are collected from GuBa during the same three-year period to build our sentiment information dataset. GuBa is the most representative internet stock message board in China where investors usually share company news, stock price movement predictions, facts, and comments (usually with strong emotional tendencies) on specific company events. Each stock has its own GuBa page where the stock-related posts can be easily accessed. Two examples of GuBa posts published by investors during the three-year period are shown in Fig. 2. The first post shows negative sentiment obviously and the other shows strong optimism about the stock price future trends.

Fig. 2
figure 2

Two GuBa posts published by investors

The stock market is closed for weekends and holidays. The posts published from 2:40 pm of the previous transaction date to 2:40 pm of the current transaction date are assigned to the current transaction date. Transaction date over 24 hours are divided by the number of days it covers. Each stock has transaction dates for a three-year period in our dataset.

However, as in other sentiment information sources, posts on the GuBa are also messy. The post content is usually varying in length, riddled with many spelling mistakes, uncommon expressions, redundant HTML links and irrelevant information. Table 6 tabulates the statistics of each transaction date concerning the min, median, mean, max and the total number of the number of posts for each stock after a clean-up pre-processing. Over this three-year period, we accumulated a total of 1,451,272 pieces of data.

Table 6 Statistics of each transaction date

4.2 Baseline setup

In the experiment, Support Vector Regression (SVR) and recurrent neural networks (RNN) are used as baselines.

4.2.1 Support vector regression

First designed by Cortes and Vapnik [14] as a classifier, SVR is employed to capture nonlinear relationship and has a global optimum. Previous studies have reported the effectiveness of SVR in financial time series forecasting problems [27, 64].

In a regression task, given a time-series data set \( F={\left\{\left({\mathbf{x}}_k,{y}_k\right)\right\}}_{k=1}^n \) derived from an unknown function y = g(x), we need to determine a function y = f(x) based on F and to minimize the difference between f and the unknown function g. The main idea of SVR is to build a mapping x → ϕ(x) to a new feature space Xaccording to the mapping scheme. The nonlinear relationship is then transformed into a linear relationship between the new feature ϕ(x) and label y in the new created space. The SVR model can be obtained as

$$ y=f\left(\mathbf{x},\boldsymbol{\upalpha}, b\right)=\sum \limits_k{\alpha}_k{y}_kK\left({\mathbf{x}}_k,\mathbf{x}\right)+b $$
(18)

Where xk is support vectors in data set F and yk is the corresponding labels. K(xk, x) = ϕ(xk) · ϕ(x) is the kernel function and “·” is the inner product in feature space X. Learning process on the given data set F is to find the support vectors and determining the parameters α and b. There requires no need for explicit calculation for the new feature ϕ(x), since a kernel function is employed in training and forecasting. The most widely used kernel is the radial basis function (RBF) with a width of σ as shown in Eq. (19):

$$ \mathrm{K}\left(\mathbf{x},\mathbf{y}\right)=\exp \left(-\left\Vert \mathbf{x}-\mathbf{y}\right\Vert /2{\upsigma}^2\right) $$
(19)

A grid-search and cross-validation process is employed to get the optimal model, and the parameter grid consists of penalty C = {0.1, 1, 2, 5, 10} and kernel parameter gamma = {0.01, 0.1, 0.2, 0.5, 0.8}.

4.2.2 Recurrent neural networks

Recurrent neural networks (RNN) [51] are wildly employed in stock market predictions [11]. RNN is a type of neural network where connections between the calculating units form a directed circle. Same task is performed for every element in a sequence and the output depends only on the previous calculation.

In our RNN model, the input value of the tth day xt = (xt, 1, ⋯, xt, m) is iterated over the following equations,

$$ {h}_t=\tanh \left(U{x}_t+W{h}_{t-1}+b\right) $$
(20)
$$ {o}_t=\tanh \left(V{h}_t+c\right) $$
(21)

where ht is the hidden state which is calculated based on the previous hidden state ht − 1 and the input xt at the current time step. ot is the predicted output value which refers to the closing price in this study. U, W and V are the input-to-hidden, hidden-to-hidden and hidden-to-output parameters respectively.

A grid-search and cross-validation process is also employed, and the parameter grid consists of dropout rate d = {0.1, 0.35, 0.5} and batch size b = {10, 100, 200, 400}.

4.3 ALSTM setup

In the experiment, three advanced methods for ALSTM training are applied. First, we make use of Root mean square prop (RMSprop) [76], a mini-batch version of rprop, as optimizer since it is “usually a good choice for recurrent neural networks” [13]. The initial learning rate is set to 0.001 as recommended in the default settings. A higher initial learning rate can reduce the time required for model optimization at an early stage, but it will bring more difficulties in achieving optimality and the model performance is restricted. Accordingly, a lower initial learning rate leads to more training epochs but a better optimum. Therefore, a decay mechanism is adopted to reduce the learning rate to half of itself when the loss rate does not decrease in 5 consecutive iterations to obtain the optimal model.

Second, early-stop mechanisms are employed to stop the training process automatically and to further reduce the overfitting risk. Max training epochs is set to 1000. When the training loss cannot be optimized after several rounds of iterations, the subsequent training becomes no longer necessary. When the loss does not decrease in 20 consecutive epochs, the model with the least loss rate is saved and is supposed to own the best generalization ability.

b=15ptThird, grid-search and cross-validation process are also employed, and the grid consists of two hyper parameters, each parameter contains several candidate hyper parameter values:

  • Dropout rate = {0.1, 0.35, 0.5}: The dropout rate of dropout layers.

  • Batch size = {10, 100, 200, 400}: The number of samples selected for training at a time.

4.4 BERT setup

In this study, the pre-trained language model BERT-base, which contains 12 Transformer blocks, 12 self-attention heads and the hidden size of 768, is employed as the encoder. The input sequence is output as a sequence representation through BERT. A special token [CLS] which contains the classification embedding is always placed at the sentence beginning. In sentiment classification tasks, the whole sequence is represented by the final hidden state h of the first token. A softmax layer is employed to predict the output probability of label c:

$$ p\left(c|h\right)=\mathrm{softmax}\left(\mathrm{Wh}\right) $$
(22)

where W means the task-specific parameter matrix. Parameters are fine-tuned by maximizing the probability of the correct label.

The parameters are randomly initialized, most of them remaining unchanged as in pre-training, except for the batch size and learning rate. To avoid overfitting, the dropout rate was always kept at 0.1 to the dense layer. For model training, we used the Adam [36] optimizer and the number of epochs is set to 3. Max sequence length is set to 32 in the training process. The optimal parameter values are usually task-specific, and therefore we employ the grid-search process to find the optimal parameters. The following possible candidate values are found to work well across all tasks:

  • Batch size = {16, 32}

  • Learning rata = {5e-5, 3e-5, 2e-5}

In this study, 100,000 GuBa posts are selected for fine tuning of the model, 90% of them for fine-tuning to find the best parameter set and the rest of them for evaluation.

4.5 Experiment setup

We conduct a large amount of comparative experiments on 28 selected stocks based on the ALSTM networks to evaluate the predicting performance, SVR and RNN are used as baseline models. The time span of the dataset is within the range from 18 November 2016 to 18 November 2019. The data form 18 November 2016 to 1 June 2019 (about 85% of the data) is used for training to conduct cross-validation to select optimal hyper parameters, and the data from 1 June 2019 to 18 November 2019 (last 15% of the data) is used for testing to evaluate the out-of-sample performances.

Following Ratto, et al. [65], we also adopt the “walk forward testing” method in cross-validation process. To maximally utilize the available data, an increasing-window was designed to run a 5-fold time split cross-validation. The first k folds of the time series data is used for training and the k + 1th fold for validation. The cross-validation process is shown in Fig. 3.

Fig. 3
figure 3

Cross-validation process of 5-fold time series splitting method

For analysing the performance of each model, RMSE, MAE and accuracy are used as evaluation metrics. The RMSE and MAE, which provide an excellent error metric, are widely used in model valuation. The accuracy is employed to evaluate the consistency of the price movement in directions between the real and predicted values.

Given a set of time series observation values and the corresponding predictions, RMSE and MAE are defined as follows,

$$ RMSE=\sqrt{\frac{1}{N}\underset{t=1}{\sum \limits^N}{\left({r}_{t+1}-{\hat{r}}_{t+1}\right)}^2\ } $$
(23)
$$ MAE=\frac{1}{N}\underset{t=1}{\sum \limits^N}\mid {r}_{t+1}-{\hat{r}}_{t+1}\mid $$
(24)

where rt + 1 and \( {\hat{r}}_{t+1} \) denotes the actual closing price and the predicted one at time t + 1 respectively. RMSE is used as the evaluation metric to find the best parameter set for each model. Each transaction date was marked a label (up, down) through comparing the closing price of two consecutive days. Accuracy is calculated by comparing the real trend with the predicted trend, which is defined as follows,

$$ accuracy=\frac{tu+ td}{tu+ td+ fu+ fd} $$
(25)

where:

  • tu: the number of samples correctly classified as uptrend.

  • td: the number of samples correctly classified as downtrend.

  • fu: the number of samples incorrectly classified as uptrend.

  • fd: the number of samples incorrectly classified as downtrend.

The purpose of this study is to employ stock prices, technical indicators and GuBa sentiments of day t to predict the closing price of day t + 1. For the RNN and the ALSTM models, we also combine the past N days’ features for training where N represents 3, 5, 7, 10, 15 and 30. This series of comparative experiments were designed to learn the sequential information and discover the best input window length for stock price prediction. We use the form of matrix and space vector to represent the input data, which is defined as:

$$ \mathrm{X}=\left(\genfrac{}{}{0pt}{}{\begin{array}{c}{\mathrm{X}}_1=\left({\mathrm{X}}_{1,1},{\mathrm{X}}_{1,2},\dots \dots, {\mathrm{X}}_{1,\mathrm{n}}\right)\\ {}\vdots \end{array}}{{\mathrm{X}}_{\mathrm{N}}=\left({\mathrm{X}}_{\mathrm{N},1},{\mathrm{X}}_{\mathrm{N},2},\dots \dots, {\mathrm{X}}_{\mathrm{N},\mathrm{n}}\right)}\right) $$
(26)

The meaning of this matrix is that there are N days’ stock data for each training input, and each day consists n features. The timing information of the historical N trading days’ sequence data are modelled, and is used for input as a vector. As shown in Fig. 4, a sliding time window is applied to get the features and labels. This window moves forward by one step until the end of the time series. Finally, by learning the historical data of the previous N days, the closing price of the N + 1 day is predicted.

Fig. 4
figure 4

Structure of one-step-ahead sliding time windows

4.6 Experiment results

The comparison of sentiment classification accuracy between sentiment lexicon and fine-tuned BERT is shown in Table 7 where we calculate the overall accuracy and the accuracy in predicting positive, negative and neutral posts. It can be observed that our BERT based sentiment classification method achieved better performance in predicting all three sentiment tendency on test set. The accuracy in sentiment classification reaches 85.9% on test set, 22.0% higher than sentiment lexicon method.

Table 7 Accuracy of sentiment classification of GuBa posts on test set

Table 8 tabulates the cross-validation results for based on different sentiment classification methods. The smallest RMSE score is marked in bold font. The result shows that the ALSTM model using different sentiment classification methods has the best performance in most cases. Among all 28 stocks, RNN obtains the best fitting results for only 1 stock using fine-tuned BERT sentiment classification and 9 stocks using sentiment lexicon. SVR obtains the worst performance among all three models.

Table 8 RMSE of models on validation sets

The results of the test set are shown in Table 9 for fine-tuned BERT and Table 10 for sentiment lexicon respectively. The smallest MAE and RMSE scores for each stock are marked in bold font, and the highest accuracy score is underlined.

Table 9 MAE, RMSE and Accuracy of BERT based models on test sets
Table 10 MAE, RMSE and Accuracy of sentiment lexicon based models on test sets

Based on fine-tuned BERT (Table 9), the ALSTM model outperforms the baselines on 21 stocks under the MAE, 20 stocks under the RMSE and 23 stocks under the accuracy. The RNN has the best performance on 7 stocks under the MAE, 8 stocks under the RMSE and 4 stocks under the accuracy. The SVR has the best performance on 1 stocks under the accuracy. It is clear that the ALSTM model outperforms the RNN and the SVR (64:15:1). Based on sentiment lexicon (Table 10), the ALSTM outperforms the baselines on 20 stocks under the MAE, 19 stocks under the RMSE and 21 stocks under the accuracy. The RNN has the best performance on 8 stocks under the MAE, 9 stocks under the RMSE and 1 stocks under the accuracy. The SVR has the best performance on 6 stocks under the accuracy. In summary, the ALSTM outperforms the RNN and the SVR (60:18:6). By comparing the results based on different sentiment classification methods, it is clear that the ALSTM obtains the best performance, the RNN obtains the second best results, while SVR has the worst results.

The average accuracy of 28 stocks using different input window length is calculated in Table 11 for easy comparison. It can be concluded that when setting the input window length to 5-day, the ALSTM model using fine-tuned BERT sentiment classification method achieves the highest accuracy. The average accuracy of 28 stocks reaches 61.24%.

Table 11 Average accuracy of different models using different input window length

4.7 Discussions on experimental results

4.7.1 The effectiveness of integrating sentiments

We use Δs to represent changes in accuracy between the results with and without sentiment feature to assess the effectiveness of integrating sentiments into stock predictions. Δs is calculated by,

$$ {\varDelta}_s=\frac{Acc_{all}-{Acc}_p}{Acc_p} $$
(27)

where Accall represents the accuracy of the ALSTM model using both price and sentiments and Accp the accuracy using price data only. The improvements between two sentiment classification methods are shown in Fig. 5. It is clear that combining price data and sentiments for stock predicting outperforms using exclusively price data for most stocks. Through further comparison, most of the improvements brought by sentiment lexicon are under 15%. The fine-tuned BERT method significantly improves the prediction accuracy to a greater extent, with some of the improvements exceeding 15%.

Fig. 5
figure 5

The Δs of each stock, where x axis represents stock codes

4.7.2 The effectiveness of using multiple information sources

To verify whether multiple information sources can improve predicting performance or the sentiment information is enough for prediction and other additional statistical measures are unnecessary. Thus, we use Δp to evaluate the difference in accuracy between the ALSTM models with and without price data. Δp is calculated by,

$$ {\varDelta}_p=\frac{Acc_{all}-{Acc}_s}{Acc_s} $$
(28)

where Accs represents the predicting accuracy based on sentiment feature only. The results of Δp are shown in Table 12. It is clear that using multiple information sources outperforms using sentiment source only in all cases.

Table 12 The prediction of the ALSTM model using different data sources

4.7.3 The effectiveness of using long input window length

To investigate whether the increase of the input window length can help the models to extract more time series information and improve the predicting performance, we employ ΔT to represent the changes of the accuracy between N time steps and 1 time step where N represents 3, 5, 7, 10, 15 and 30. ΔT is calculated by,

$$ {\varDelta}_T=\frac{Acc_N-{Acc}_1}{Acc_1} $$
(29)

where AccN is the average accuracy of 28 stocks when the input window length is set to N and Acc1 is the average accuracy when N = 1. The changes are shown in Table 13. It can be observed that using the 5-day time series data as model input can improve the performance for all proposed models in average accuracy.

Table 13 The ΔT of each model based on different input window length

5 Conclusions and future work

Stock price prediction is an important aspect of formulating a low-risk and high-return investment. This study focuses on an increasingly significant aspect of financial market research, namely: how to integrate investor sentiments from social media, and make model more qualified to learn time series information. To address the problem, we take the GuBa dataset of 28 stocks from November 18, 2016 to November 18, 2019 for efficient stock price movement prediction using SVR, RNN and ALSTM models. In this work, we propose a fine-tuned BERT sentiment classification model for sentiment analysis and a sentiment lexicon based on NTUSD for comparison. MAE, RMSE and accuracy are employed to evaluate the predictive accuracy. Furthermore, we evaluate the improvements bring by using different input window length. Results show that,

  1. 1.

    Based on multiple information sources, the ALSTM model performs better than the SVR and the RNN under the MAE, RMSE and accuracy.

  2. 2.

    Based on ALSTM, using multiple information sources improves the prediction accuracy than using either stock price data or sentiments.

  3. 3.

    The fine-tuned BERT model achieves higher accuracy in sentiment classification task, and the exploitation of the sentiment feature computed by the fine-tuned BERT model also led to better predicting performance.

  4. 4.

    Combining the 5-day features as a long time series sequential input for models to learn achieves the best predicting accuracy.

Furthermore, there are several future avenues available for this study. Sentiments from social media are the only sentiment resource considered in this study. However, the news data is also widely used in stock price predictions, as it is an important information source about the situation of the country. Moreover, only the historical prices, technical indicators and social media sentiments are employed in this study. Considering the complex and volatile stock market environment, we can further design another prediction model to extract information from other useful sources to make more comprehensive prediction. For example, the company’s financial conditions, which can be concluded from the company’s financial statements and balance sheet. Finally, a more advanced hyper-parameters selection scheme can also be employed in future experiments.