1 Introduction

Big data analysis is used to systematically extract critical information for crucial decision making in fields such as the medical industry (Carnicero and Rojas 2019), manufacturing (Shang and You 2019) and business. Big data analysis has also been applied to the stock market. Typically, statistical analysis and machine learning (ML) methods are used to extract trading rules from trading information such as stock prices and trading volumes. Based on these rules, investment advice is provided to investors (Patel et al. 2015).

Natural language processing (NLP) for text-based stock market information extraction has received considerable attention in decade because it has achieved satisfactory results (Ding et al. 2016; Feuerriegel and Gordon 2018). Typically, investors refer to the processed textual information to make investment decisions. The stock market is dynamic and changes constantly. In this digital age, information about stocks is available online across numerous websites and social media platforms. Therefore, many researchers have focused on text-based data to accurately predict stock market behavior (Wu et al. 2019a).

Financial reports and online stock forums provide crucial updates about the stock market. Several studies have been conducted on using online information to predict stock trends (Zhang et al. 2018b; Yadav et al. 2019). Agarwal et al. (2019) studied stock trends from 1992 to 2017 and concluded that sentiment analysis of views on stocks from online sources can help investors make investment decisions. Several studies have focused on identifying the trends of message such as stock news using sentiment analysis for stock market analysis and forecasting, predicting stock prices (Kim et al. 2018; Shah et al. 2018; Vanstone et al. 2019) and determining trading strategy design (Stefan and Helmut 2014). However, these studies have only focused on the price trend of the stock. An increase in the price indicates positive sentiment and vice versa, and positive sentiment indicates the potential of the price of the stock to increase further (Wu et al. 2014). For example, if a stock exhibits a positive trend and limited fluctuations, then investors do not sell the stock because they will expect the stock price to increase in the future and thus hold the stock. In this case, the sentiment is at the point of high valence with low arousal. Thus, such a state indicates uptrend and delayed trading. The valence considers the intensity of positive and negative sentiment, and arousal considers the intensity of calm and excitement (Warriner et al. 2013). The dimensional method is used to express the emotional state as a multidimensional continuous value, which results in accurate intensity. Therefore, the stock dimensional valence–arousal (SDVA) task for stock market is essential analysis. For an effective prediction model for SDVA task, accurate sentiment prediction of text is necessary. Deep learning (DL) algorithms extract complex data at a high level of abstraction, resulting in accurate text analysis. The hierarchical attention network (HAN) model can be used to extract textual information through a hierarchical coding process. The model retains contextual relevance markers, resulting in high classification performance (Gao et al. 2018).

Although stock market trends have often been analyzed, dimensional sentiment analysis has seldom been applied to stock market analysis. Therefore, we performed dimensional sentiment analysis to the Taiwan stock market to define stock sentiment status and artificial intelligence (AI) techniques to solve the SDVA task of the stock market. A modified HAN model exhibited superior prediction performance concerning SDVA for stock market and provided accurate information for investment decisions. Therefore, the stock dimensional valence–arousal prediction will be solved by our proposed hierarchical title–keyword-based attention hybrid network (HAHTKN). There are some hypotheses including our proposed HAHTKN can be improved prediction performance compared with baseline machine learning models; title–keyword-based attention mechanism can be reduced the prediction error compared with the common attention mechanism. Besides, this paper has some limitations such as the dataset in this paper is created by this research which is the first time to build stock valence–arousal indicators with real value. So that there is no benchmark dataset to evaluate our proposed model. Another is that the attention mechanism in this paper needs to use a major feature to attention another feature so that not only content in stock news can provide information but also the title or keyword is very important information compared to content.

The contributions of the proposed method are as follows:

  1. 1.

    Using a dimensional valence–arousal approach for stock sentiment analysis of traditional Chinese text regarding the Taiwan stock market. The goal is to define SDVA for data annotation and the SDVA indictors can assist investors by providing them with effective investment advice.

  2. 2.

    The HAHTKN model was used to fit the sentiment analysis of the SDVA task on stock message. The model included the six sub-models such as title encoder, keyword encoder, title–keyword encoder, word-level encoder, sentence-level encoder and the SDVA prediction layer. Therefore, the estimating processes of the proposed model are similar to human decision making. More complete extraction of text features is superior to the extraction of other baseline models.

  3. 3.

    An annotation data set was derived for the dimensional valence–arousal task for stock news.

The remainder of this paper is organized as follows. Section 2 reviews the literature including stock prediction, sentiment analysis and deep learning; Sect. 3 describes the proposed HAHTKN model for trend and trading prediction of stock; Sect. 4 gives the experimental results, including model performance comparisons and statistical test; conclusions and directions for future work are presented in Sect. 5.

2 Related work

In this section, we review the literature on the development and application trends in stock prediction, sentiment analysis and deep learning.

2.1 Stock prediction

Stock markets play a crucial role in the economy because investors can share the profits of the company or share the losses of the company by buying and selling the stocks. To maximize the profit in stock market for investors, many researchers have conducted several methods to predict trends of stock market. Depending on the type of data assessed, stock market prediction is basically divided into technical analysis and fundamental analysis, which are typically based on time-series stock transaction information and text-based information (Islam et al. 2018; Nti et al. 2020). Between these methods, the technical analysis approach has always been a crucial part of stock market prediction because of its measurability and credibility. The technical indicators have been used in several researches to predict future stock price or trend. A Xuanwu system was proposed by Zhang et al. (2018a) for predicting upward, downward and flat stock trends. The Xuanwu system uses the random forest model and was developed through technical analysis of a 7-year period of data from the Shenzhen Growth Enterprise Market in China to improve the accuracy. A particle swarm optimization (PSO) algorithm was used with a neural network (NN) model to predict stock price direction and create initial weights of the NN. This approach reduced the time required for computation by the PSO algorithm (Chiang et al. 2016). These researches have built an automatic system to predict the stock market which is using automatically generating training samples and meaningful features as technical indicators. However, technical analysis still is an important approach in the stock prediction task.

With the development of the Internet, NLP and ML techniques, extensive text-based information regarding the stock market can be processed automatically, which enhances the power of fundamental analysis. Many studies have verified the stock market forecasting capabilities of unstructured stock messages. For example, experimental results in (Ingle and Deshmukh 2017) indicated that their purposed HMM model achieves improved accuracy with extracting term frequency-inverse document frequency (TF-IDF) features from online news sources. Also, many studies have been used some features to predict stock market performance such as news articles related to the financial market (Shi et al. 2018), assessing political situations (Khan et al. 2019) and social media (Saumya et al. 2016). Ding et al. (2016) proposed a model that employed a knowledge graph to extract event embedding, and the results demonstrated that this method predicted stock market volatility accurately.

However, several researchers have used both structured data, such as technical indicators, and unstructured data, such as news articles, to forecast stock market performance (Chen et al. 2016a). Gálvez et al. (2017) applied the same classification system with multiple combinations of technical indicators and stock message board information and demonstrated that text-based information can improve the performance of the classification model. Wang et al. (2017) used Stanford Parser to build a grammar tree that helps authors to identify the correspondence between core words and their sentiment values. But these methods may not robust enough and ignore the structure of the whole contents. Shi et al. (2018) proposed a deep neural network model to hierarchically deconstruct news headlines, trying to let computers learn the way people read.

In summary, these studies suggest that text-based data can be used in stock market prediction and emphasize the reliability of fundamental analysis to some extent. Several researchers have used both structured data, such as technical indicators, and unstructured data, such as news articles, to forecast stock market performance. Thus, unstructured data can be used to evaluate market reaction, and this improves the predictability of the stock market. However, sentiment analysis plays a crucial role in the use of unstructured data.

2.2 Sentiment analysis

The emotional response of the public toward the market as a whole can be understood using text-based stock messages. This helps investors predict stock trends. Therefore, analyzing the mood of the public in stock market prediction has been a topic of considerable concern in the research community. With the emergence of social media and sentiment-tracking techniques, many researchers extract sentiments from websites such as Yahoo! Finance (Ranco et al. 2016) and social networking sites such StockTwits (Batra and Daudpota 2018). In social media platforms, stock trend prediction based on expressions of sentiment made on Twitter has attracted considerable attention from many researchers.

Many researchers classify the sentiment extracted from the public emotions into categories, such as positive, neutral and negative, and believe that most positive sentiment means that the stock trend is bullish and vice versa. For instance, researchers (Pagolu et al. 2016) build a sentiment analyzer especially for identifying stock market sentiment which classified the sentiment into 3 classes. The research (Chen et al. 2016b) assigns the respective labels microblogs into 7 emotions (Happiness, Good, Sadness, Surprise, Fear, Disgust and Anger) and selected the emotions which appeared most times in each microblog. And their results showed that only “Happiness” and “Disgust” have significant causative relationships with stock prices movement. Different from the two studies categorized public sentiments into positive and negative with five sub-classes (Joy, Anger, Disgust, Fear and Sadness), but also considered the Bullishness which is greater if the number of positive microblogs increases, and the Agreement which is greater if more people share the same sentiment. Li et al. (2017) transformed the sentiment into discrete labels after applying sentiment dictionaries SentiWordNet 3.0 and developed a concept graph which considered the relationship between entities in Tweets. Previous studies applied categorical sentiments to analyze the text messages toward the stock market which is easy and intuitive to understand, but this method ignores that sentiment is not an absolute category; it may be neutral to positive/negative and other emotions between categories.

While there are some works measured sentiments in statistical-based method, Oliveira et al. (2017) investigated microblogging data to forecast stock market variables such as stock returns. Wang et al. (2016) proposed a novel sentimental feature extraction technique, using TF-IDF to find the most important words which named basic sentiword and assigning sentimental scores based on the number of occurrences of each word and the rise or fall ratio of related historical stocks. They took another perspective to define the sentiment of a word without predefine the sentiment of each word but using historical stock market fluctuations to push back and forth the sentiment of the word. Most importantly, they considered the importance of each word in the contents. This concept is similar to the attention mechanism we proposed. In our model, the model will focus on the correspondingly important part of content using the vector features of keyword and title.

However, most sentiment analyses concerning the stock market have mainly concentrated on stock trends, but this makes predictive ability limited because stock trend prediction does not consider trading immediacy. In sentiment analysis, dimensional sentiment such as VA feature space provides a deeper interpretation of sentiment. Unlike categorical sentiment analysis, dimensional sentiment analysis aims to transform emotional states into continuous numerical values on multiple dimensions (Russell 1980). By using this approach, researchers can detect fine-grained differences between the same categorical sentiment in texts. For example, the authors (Hasan et al. 2019) were proposed a classification model to classify the Tweets in multi-dimension emotions of valence–arousal automatically in real time. The authors (Salehan and Kim 2020) studied the relationship between valence–arousal sentiments and influence of numerous Tweets. Their result shows that negative sentiment with high arousal notably expands the information spread in social media. Conversely, the tweets with low-arousal negative sentiment relatively retweet less frequencies than high arousal. In addition to the mentioned research areas, the VA model has been used in stock prediction. Dong et al. (2015) examined the relationship between social moods extracted from Sina Weibo and the Shanghai Composite Index, with the results indicating that negative sentiment with low arousal tends to induce risky decision making and behaviors. Ge et al. (2020) explored that messages on social media affect the stock market trend. According to their experimental result, the high arousal emotions in social media bring greater volatility to the stock market after the market crash. In addition, the VA concept has also been widely used in research related to social network analysis (Max et al. 2020) and consumer behavior (Jaeger et al. 2019). However, they just used the VA mode for sentiment word instead of the sentiment of stock market. For example, the word such happiness belongs to high valence with high arousal. Wu et al. (2019b) proposed a deep learning model to predict VA of sentiment on stock news for stock market. They have used an attention mechanism to estimate relationship between summary and keyword.

According to previous studies, sentiment analysis has become to solve multiple-dimensional sentiment tasks in which the valence–arousal is growing up in the past few years. In the stock market analysis field, there is very few research about the multiple-dimensional sentiment analysis using the valence–arousal model, but this concept and definition are very useful to help investors capture what’s situation of stock which they need to know. Therefore, this study also establishes the relationships on two-dimensional sentiment including trend direction and trading actions according to the text of summary, title and keyword. The different stock messages influence the investor’s emotions, which are represented in two dimensions, have remained unanswered because applying dimensional sentiment analysis in stock prediction has remained rare. Thus, to better comprehend the influence of stock messages, applying deep learning techniques to solve the VA sentiment prediction is the focus of this research.

2.3 Machine learning

Machine learning has been widely adopted and continues to gradually increase. In the medical field, Rohini et al. (2020) used the multiple linear regression, logistic regression and support vector machine to predict and classify the features of Alzheimer’s disease or normal cognitive decline in older adults. The results show that the use of simple machine learning models in disease prediction can increase accuracy. In the industrial field, Beninger et al. (2020) aimed to detect mind wandering and predict the response time on driving pattern based in a fully immersive driving. They compared the effectiveness of support vector machines, random forests and multi-layer perceptrons on the data. These results indicate that the effect of the random forest is better. Besides, Gaurav et al. (2020) researched the classification of email spam on the Internet and compared the three methods for Naive Bayes, decision trees and random forests. The research confirmed that random forests have high precision in this case. However, in the field of stock prediction, many ML models have been confirmed to improve classification accuracy. For example, random forest model can be used to forecast medium- and long-term trends of stock prices (Basak et al. 2019). Support vector regression (SVR) has been used to predict stock prices for firms with large and small capitalization, and the results have demonstrated that SVR improves the prediction performance during periods of low volatility (Henrique et al. 2018). However, the above researches only use past data to make predictions and did not use other stock-related text messages as information into consideration.

To achieve better forecasting performance, text-based data have been conducted to capture the more important information through a lot of processes by NLP techniques. And the deep learning techniques also use to enhance prediction performance by adding more hidden layers. Maqsood et al. (2019) proposed a deep learning method to forecast stock prices. They performed sentiment analysis on tweets related to stocks. The result showed that these text data improved the performance of stock price prediction. The another (Abdi et al. 2019) have been used a deep learning model called RNSA to classify the sentiment of large-scale reviews. As shown in the results, RNSA model is composed by RNN and LSTM which has improved the classification accuracy of the sentiment analysis on reviews. Song et al. (2019) proposed a CNN-LSTM framework to do abstractive text summarization task, and the results show that their model outperforms than other approaches in both semantics and syntactic structure. To analyze the semantics and emotions in the text of stock news, the reader feelings play a crucial role in the prediction results. Therefore, the authors (Liang et al. 2018) studied reader emotions and proposed a model which is a labeled topic model and to solve sparsity problem in short texts. However, studies have focused on feature engineering to obtain effective text features and improved prediction performance of many ML approaches (Ikonomakis et al. 2005). However, in previous ML techniques, the bag-of-words model was mainly used, which entails the text being represented as a collection of words, with the order of the words ignored. Semantic misunderstanding during emotion analysis can easily result. These researches indicated that DL approaches with text-based data using more layers with different NN models such as CNN and RNN have a better understanding of the text which greatly improves the prediction performance.

To better understand unstructured data such as texts, deep learning models are used to perform semantic extraction tasks. By directly extracting features of the texts, model prediction can be directly completed without any feature engineering process. Yang et al. (2016) proposed the HAN model and used it to solve document classification tasks, which included six large-scale text classifications. The HAN model is akin to the process of human reading; first, a human/the HAN focuses on the words of the sentence and then focuses on the sentences of the document. At present, the HAN model is applied in lyric analysis for music classification. The results showed that the HAN model is superior to other non-neural models and simple neural models (Tsaptsinos 2017). A hierarchical structure and attention mechanism has been used to model user dynamics preferences. The results of the experiments proved that this method outperformed traditional methods (Ying et al. 2018). Another study used the HAN model for document classification to extract abstract features (Abreu et al. 2019).

In summary, machine learning and deep learning model have obtained enough prediction performance in many sentiment analysis tasks, but this paper is a two-dimensional prediction problem and it is a difference to single predicting target. According to past researches, there is no related work to solve the trend and trading prediction from the stock message, like valence and arousal in sentiment analysis. But the feature from stock message has provided more important information to enhance prediction performance. Therefore, we use a powerful HAN model to capture sentiment states (trend and trading) of the stock message such news, because HAN has high performance for long length document and stock news have more precise information compared to other sources such as social media.

3 Methodology

The paper aimed to predict SDVA of text related to the stock market, and the flow of proposed model is shown in Fig. 1. First, stock-related online messages such as stock news were collected. Then, the intensity of SDVA was annotated by experts. Second, data pre-processing was performed followed by word segmentation, and third, the proposed HAHTKN model was used to train the model from the training set. The best model was selected according to performance on the validation set. Last, the test set was used to evaluate the performance of the proposed model.

Fig. 1
figure 1

Flowchart of SDVA task using HAHTKN model

3.1 Stock message collection and annotation

We collected relevant stock market news articles from the Internet. The stock articles were used as information to form the text feature of each stock news item. Three annotators annotated each stock news item independently. The annotators label each news item with a valence value (trend intensity) according to the stock trend (upward or downward). In labeling arousal, trading intensity is a major concept in which the investor holds a stock or trades it immediately. The valence and arousal values range from 1 to 10. For example, a rating of 1 for valence is a negative sentiment (downtrend), and a rating of 10 for valence is a positive sentiment (uptrend). Regarding arousal, a rating of 1 for arousal indicates low trading intensity (no trading activity), and a rating of 10 indicates high intensity (trade immediately). The average values of valence and arousal for each stock news item from three annotators were used.

3.2 Text pre-processing and splitting

Because the proposed HAHTKN is a hierarchical architecture, we segmented sentences using punctuation such as periods, commas, colons and semicolons. Then, we used the CKIP taggerFootnote 1 to segment words of each sentence related to the stock news item. Regarding the text of stock news items, we used the title, summary and keywords as the text-based content. To obtain the weights of the best HAHTKN model, we proposed training set for model training, validation set for model selection and test set for evaluation.

Figure 2 shows the text pre-processing flow. First, the text of title and summary needs to sentences splitting by punctuation, and then these sentences will be segmented by CKIP tagger. Another, the keywords originally existed in the form of words, so do not run the segment process. Second, the words of the training set include title, keyword and summary use to build a vocabulary list which is using to convert word to index. Finally, all words of keyword, title and summary in all sets are converted into index according to the vocabulary list.

Fig. 2
figure 2

The flowchart of text pre-processing

3.3 The HAHTKN model

We proposed the HAHTKN model to predict SDVA with respect to trading intensity and trends. In the proposed HAHTKN, multiple DL sub-models were combined as a nonlinear regression model for SDVA prediction. Figure 3 depicts the proposed architecture including title encoder, keyword encoder, title–keyword encoder, word-level encoder, sentence-level encoder and stock DVA prediction layer.

Fig. 3
figure 3

Architecture of the proposed model

Algorithm 1 summarizes the whole working procedure of our proposed HAHTKN model. For each batch in each epoch, we not only embed title but also use a bidirectional GRU (line 3) and an attention mechanism to obtain \( v_{ }^{{\mathrm{T}}} \) firstly. Second, we embed keywords and use an attention mechanism in keyword encoder (line 4), and we encode \( v_{ }^{{\mathrm{T}}} \) and \( v_{ }^{K} \) into \( v_{ }^{TK} \) (line 5). Third, we encode words on each sentence using a CNN, a bidirectional GRU and an attention mechanism in word encoder (line 6) to get sentence vector \( s_{i}^{ } \). Fourth, in sentence encoder (line 7), we used a bidirectional GRU and the Luong attention mechanism to get the text vector \( v \). Next, we used a linear model to calculate \( \hat{y} \) (line 8), and then we selected the mean square error (MSE) optimizer to minimize the square error between the prediction value \( \hat{y} \) and the actual target value \( y \) (line 9). Finally, we used Adam optimizer to optimize all training parameters (line 10).

figure a

The proposed HAHTKN model comprised six sub-models; the detailed processing is as follows.

3.3.1 Title encoder

Given a title with segmented words \( w_{l}^{{\mathrm{T}}} , l \in \left[ {1,L} \right] \), we obtained each word embedding \( x_{l}^{{\mathrm{T}}} \) from the embedding matrix \( W_{{\mathrm{T}}} \). In this case, we used a bidirectional gated recurrent unit (GRU) to obtain hidden states of word embeddings from both directions, that is,\( h_{l}^{{\mathrm{T}}} = \left[ {\vec{h}_{l}^{{\mathrm{T}}} , \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftharpoonup}$}} {h}_{l}^{{\mathrm{T}}} } \right] \). A bidirectional GRU contains the forward \( \overrightarrow {GRU} , \) which reads the title from \( x_{1}^{{\mathrm{T}}} \) to \( x_{L}^{{\mathrm{T}}} \), and a backward \( \overleftarrow {\text{GRU}} \), which reads the title from \( x_{L}^{{\mathrm{T}}} \) to \( x_{1}^{{\mathrm{T}}} \):

$$ x_{l}^{{\mathrm{T}}} = W_{{\mathrm{T}}} \varphi \left( {w_{l}^{{\mathrm{T}}} } \right), \quad l \in \left[ {1,L} \right] $$
(1)
$$ \vec{h}_{l}^{{\mathrm{T}}} = \overrightarrow {GRU} \left( {x_{l}^{{\mathrm{T}}} } \right), \quad l \in \left[ {1,L} \right] $$
(2)
$$ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftharpoonup}$}} {h}_{l}^{{\mathrm{T}}} = \overleftarrow {GRU} \left( {x_{l}^{{\mathrm{T}}} } \right), \quad l \in \left[ {L,1} \right] $$
(3)

where \( \varphi \) is a one-hot operation, l is the number of words of each title, and the \( W_{{\mathrm{T}}} \) denotes the trainable parameters in the HAHTKN. Standard GRU models were used for \( \overrightarrow {GRU} \) and \( \overleftarrow {\text{GRU}} \). However, not all words of the title contribute equally to the meaning of the title. Therefore, we used a simple attention mechanism with a title-level context vector \( u_{l}^{{\mathrm{T}}} \) to estimate the importance value of each word and combined those informative words into a title vector \( v^{{\mathrm{T}}} \) that summarizes the information of words in the title of the stock article. A title-level representation \( v^{{\mathrm{T}}} \) is expressed as follows:

$$ u_{l}^{{\mathrm{T}}} = { \tanh }\left( {W_{{T^{\prime}}} h_{l}^{{\mathrm{T}}} + b_{{\mathrm{T}}} } \right) $$
(4)
$$ \alpha_{l}^{{\mathrm{T}}} = \frac{{\exp\left( {\left( {u_{l}^{{\mathrm{T}}} } \right)^{{ \intercal }} u_{{\mathrm{T}}} } \right)}}{{\mathop \sum \nolimits_{v} \exp\left( {\left( {u_{v}^{{\mathrm{T}}} } \right)^{{ \intercal }} u_{{\mathrm{T}}} } \right)}} $$
(5)
$$ v^{{\mathrm{T}}} = \mathop \sum \limits_{l} \alpha_{l}^{{\mathrm{T}}} u_{l}^{{\mathrm{T}}} $$
(6)

where \( W_{{T^{\prime}}} \), \( b_{{\mathrm{T}}} , \) and \( u_{{\mathrm{T}}} \) are the trainable parameters in the HAHTKN. The model uses \( u_{l}^{{\mathrm{T}}} \) to measure the importance of a word and acquire a normalized importance weight \( \alpha_{l}^{{\mathrm{T}}} \), and l and v both represent the sequence of a word in each title. The model computes the title vector \( v^{{\mathrm{T}}} \) as a weighted sum of the title-level representations based on the weights.

3.3.2 Keyword encoder

Given a stock message with the keyword \( w_{m}^{K} , m \in \left[ {1,Q} \right] \), \( Q \) denotes the maximum number of keywords in a stock news item. We obtained each keyword embedding from the embedding matrix \( W_{K} \). To measure the varying importance of a keyword, we used the simple attention mechanism to evaluate the importance of keywords in each stock news item and aggregated the representation of those informative words into a keyword vector as follows:

$$ x_{m}^{K} = W_{K} \varphi \left( {w_{m}^{K} } \right), m \in \left[ {1,Q} \right] $$
(7)
$$ u_{m}^{K} = { \tanh }\left( {W_{{K^{\prime}}} x_{m}^{K} + b_{K} } \right) $$
(8)
$$ \alpha_{m}^{K} = \frac{{\exp\left( {\left( {u_{m}^{K} } \right)^{{ \intercal }} u_{K} } \right)}}{{\mathop \sum \nolimits_{r} \exp\left( {\left( {u_{r}^{K} } \right)^{{ \intercal }} u_{K} } \right)}} $$
(9)
$$ v^{K} = \mathop \sum \limits_{m} \alpha_{m}^{K} x_{m}^{K} $$
(10)

where \( W_{K} \), \( W_{{K^{\prime}}} \), \( b_{K} , \) and \( u_{K} \) are trainable parameters in the HAHTKN. We used \( u_{m}^{K} \) to measure the importance of keywords and acquired a normalized importance weight \( \alpha_{m}^{K} \), and m and r both represent the sequence of a keyword in each stock news item. The keyword representation is obtained by summing the weighted keyword-level vector.

3.3.3 Title–keyword attention

Given title vector \( v^{{\mathrm{T}}} \) and keyword vector \( v^{K} \), \( v^{{\mathrm{T}}} \) and \( v^{K} \) are concatenated using a simple attention mechanism to obtain a title–keyword context vector \( u^{TK} \). Then, the title–keyword context vector \( u^{TK} \) is used to estimate the importance value of different titles and keywords. These informative words are combined into a title–keyword-based word vector \( v^{TK} \) that summarizes all the information of titles and keywords in a stock message as follows:

$$ u^{TK} = \tanh \left( {w^{TK} \left[ {v^{{\mathrm{T}}} ,v^{K} } \right]} \right) $$
(11)
$$ \alpha_{o}^{TK} = \frac{{\exp \left( {\left( {u_{o}^{TK} } \right)^{{ \intercal }} u_{TK} } \right)}}{{\mathop \sum \nolimits_{p} \exp \left( {\left( {u_{p}^{TK} } \right)^{{ \intercal }} u_{TK} } \right)}} $$
(12)
$$ v^{TK} = \mathop \sum \limits_{o} \alpha_{o}^{K} u_{o}^{TK} $$
(13)

where \( w^{TK} \) and \( u_{TK} \) are trainable parameters in the HAHTKN.

3.3.4 Word-level encoder

Given the ith sentence with words \( w_{iu} ,u \in \left[ {1,N} \right] \), N is the maximum number of words in a sentence. First, we obtained each word embedding from the embedding matrix \( W_{{\mathrm{E}}} \). Then, we apply the standard convolution neural networks (CNN) as filters to produce a feature map from word embeddings. Moreover, this paper designs multiple filters with max-pooling operation to form a CNN feature extractor and features of each filter will be concatenated to hidden features \( x_{ij} \). After CNN extraction, we also used a bidirectional GRU to obtain hidden states of word embeddings from both directions, that is,\( h_{ij} = \left[ {\vec{h}_{ij} , \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h}_{ij} } \right] \). A bidirectional GRU contains the forward \( \overrightarrow {GRU} \), which reads the sentence \( s_{i} \) from \( x_{i1} \) to \( x_{iP} \), and a backward \( \mathop {GRU}\limits^{{ \leftharpoonup }} \), which reads from \( x_{iP} \) to \( x_{i1} \):

$$ x_{iu} = W_{{\mathrm{E}}} \varphi \left( {w_{iu} } \right), \quad u \in \left[ {1,N} \right] $$
(14)
$$ \left[ {x^{\prime}_{i1} ,x^{\prime}_{ij} ,x^{\prime}_{iP} } \right] = CNN\left( {\left[ {x_{i1} ,x_{iu} ,x_{iN} ,} \right]} \right), $$
(15)
$$ \vec{h}_{ij} = \overrightarrow {GRU} \left( {x^{\prime}_{ij} } \right), \quad j \in \left[ {1,P} \right] $$
(16)
$$ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h}_{ij} = \overleftarrow {GRU} \left( {x^{\prime}_{ij} } \right), \quad j \in \left[ {P,1} \right] $$
(17)

where \( W_{{\mathrm{E}}} \) denotes the trainable parameters in the HAHTKN. To weigh words that are relevant, we used a simple attention mechanism with a word-level context vector \( u_{{\mathrm{W}}} \) to estimate the importance value of each word and combined those informative words into a sentence vector \( s_{i} \) that summarizes all the information of words in a sentence. A sentence-level representation \( s_{i} \) is expressed as follows:

$$ u_{ij} = { \tanh }\left( {W_{{\mathrm{S}}} h_{ij} + b_{{\mathrm{S}}} } \right) $$
(18)
$$ \alpha_{ij} = \frac{{\exp \left( {u_{ij}^{{ \intercal }} u_{{\mathrm{W}}} } \right)}}{{\mathop \sum \nolimits_{q} \exp \left( {u_{iq}^{{ \intercal }} u_{{\mathrm{W}}} } \right)}} $$
(19)
$$ s_{i} = \mathop \sum \limits_{j} \alpha_{ij} h_{ij} $$
(20)

where \( W_{{\mathrm{S}}} \), \( b_{{\mathrm{S}}} \), and \( u_{{\mathrm{W}}} \) are trainable parameters in the HAHTKN. The model uses \( u_{{\mathrm{W}}} \) to measure the importance of a word and acquire a normalized importance weight \( \alpha_{ij} \), and j and q both represent the sequence of a word in each sentence. The weighted word-level hidden states are summed and represented by the sentence vector \( s_{i} \).

3.3.5 Sentence-level encoder

Given the sentence vectors \( s_{i} \), sentence representations by forward and backward bidirectional GRUs to encode the sentences are as follows:

$$ \vec{h}_{i} = \overrightarrow {GRU} \left( {s_{i} } \right),\quad i \in \left[ {1,M} \right] $$
(21)
$$ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h}_{i} = \overleftarrow {GRU} \left( {s_{i} } \right),\quad i \in \left[ {M,1} \right] $$
(22)

here \( \vec{h}_{i} \) and \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h}_{i} \) are concatenated into a sentence hidden state, that is, \( h_{i} = \left[ {\vec{h}_{i} , \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h}_{i} } \right] \). Then, the Luong attention mechanism is used to estimate the importance of each sentence hidden state \( h_{i} \) based on title–keyword-based vector \( v^{TK} \) and to aggregate the representation of those informative sentences into a text vector \( v \):

$$ u_{i} = \left\{ {\begin{array}{*{20}l} {h_{i}^{{ \intercal }} v^{TK} } \hfill & {\text{dot}} \hfill \\ {h_{i}^{{ \intercal }} W_{{\mathrm{b}}} v^{TK} } \hfill & {\text{general}} \hfill \\ {v_{{\mathrm{b}}}^{{ \intercal }} \tanh \left( {W_{{\mathrm{b}}} \left[ {h_{i} ,v^{TK} } \right]} \right)} \hfill & {\text{concat}} \hfill \\ \end{array} } \right. $$
(23)
$$ \alpha_{i} = \frac{{\exp\left( {u_{i} } \right)}}{{\mathop \sum \nolimits_{j} \exp\left( {u_{j} } \right)}} $$
(24)
$$ v = \mathop \sum \limits_{i} \alpha_{i} h_{i} $$
(25)

where \( W_{{\mathrm{b}}} \) and \( v_{{\mathrm{b}}}^{{ \intercal }} \) denote the trainable parameters in the HAHTKN. The model uses \( u_{i} \) to measure the importance of a sentence and acquire a normalized importance weight \( \alpha_{i} \). Then, the weighted sentence-level hidden states are summed. In addition, the concat operation is evaluated by concatenating sentence and keywords vectors. The general operation directly uses the keywords vector to estimate the attention weight, and in the dot operation, no model weight is attached to estimate the attention weight.

3.3.6 Stock DVA prediction layer

The text vector \( v \) is a high-level representation of the message, and it can be used to predict stock valence and arousal \( \hat{y} \in {\mathbb{R}}^{2} \):

$$ \hat{y} = W_{VA} v + b_{VA} $$
(26)

where \( W_{VA} \) and \( b_{VA} \) are trainable parameters in the HAHTKN.

3.4 Model training and optimization

The mean square error (MSE) is used to calculate the loss between predicting SDVA and target SDVA. The training loss function is defined as follows:

$$ L\left( \theta \right) = \frac{1}{L}\mathop \sum \limits_{l} \left( {\hat{y}_{l} - y_{l} } \right)^{2} $$
(27)

where \( y_{l} \) is the target SDVA values of the lth stock message. The parameters of \( \theta \) are jointly trained, converged and optimized using the Adam optimizer.

4 Experiments

We collected 3588 stock market messages from January to April 2019 from the Anue website (https://www.cnyes.com/). Our data contained the title, summary and keywords of a stock message. Valence, as trend intensity, and arousal, as trading intensity, were independently annotated by three experts. The trend and trading values ranged between 1 and 9. The deviation values were removed, and the average values of the valence and arousal of three annotated values were obtained to get the final SDVA values. In addition, this data set was collected for the SDVA sentiment analysis. It is the first study in the relevant stock sentiment analyses.

Figure 4 illustrates the valence and arousal distributions of the collected data. Both low valence and high valence typically indicate high arousal, which reveals that a stock price is expected to increase or decrease sharply in the future. Thus, an investor should trade immediately. Furthermore, middle arousal and low arousal typically have middle valence because the message does not reveal a drastic increase or decrease in the stock price. An investor is expected to hold their stocks.

Fig. 4
figure 4

Valence and arousal score distributions

Table 1 presents the statistics of the collected data set. A total of 3588 documents along with the annotated SDVA was divided into data for training (73.2% of the data), validation (13.9%) and test (12.9%). The average value of valence was approximately 5.73, and the average value of arousal was approximately 4.06. Thus, the average value of arousal in our data set had a downward bias. The annotators may have been more conservative about transactions. Most of the stock news items did not motivate immediate trading.

Table 1 Data set statistics

4.1 Evaluation metrics

We compared the SDVA values labeled by each annotator against their corresponding means across the three annotators to calculate the error rates using the following metrics:

  • Mean absolute error (MAE):

    $$ {\mathrm{MAE}} = \frac{1}{L}\mathop \sum \limits_{l = 1}^{L} \left| {A_{l} - \bar{A}_{l} } \right| $$
    (28)
  • Root-mean-square error (RMSE):

    $$ {\mathrm{RMSE}} = \sqrt {\frac{{\mathop \sum \nolimits_{l = 1}^{L} \left( {A_{l} - \bar{A}_{l} } \right)^{2} }}{L}} $$
    (29)

    where \( A_{l} \) denotes the valence or arousal value of the lth message rated by an annotator, \( \bar{A}_{l} \) denotes the mean valence or arousal of the lth message calculated over the three annotators, and L is the total number of messages in our collected data set.

For the model evaluation, the two metrics of MSE and medium absolute error (MDAE) were used to evaluate each model’s performance with respect to stock valence and arousal. The MDAE formula is defined as follows:

$$ {\mathrm{MDAE}} = {\mathrm{median}}\left( {\left| {\hat{y}_{1} - y_{1} } \right|, \ldots ,\left| {\hat{y}_{L} - y_{L} } \right|} \right) $$
(30)

4.2 Human annotation results of valence and arousal with respect to stock news

Table 2 depicts the error rates of different annotators in rating the valence and arousal of news in the data set. In all metrics, the mean error rates of arousal ratings were higher than those of valence ratings. In the valence dimension, the MAE was approximately 0.457–0.610, and the RMSE was approximately 0.633–0.838. In the arousal dimension, the MAE was approximately 0.943–1.221, and the RMSE was approximately 1.198–1.479, reflecting the maximum gap of all error rates. The annotators' attitudes toward valence were more consistent than arousal. We concluded that the applied two dimensions of SDVA in the stock market were consistent with other VA models used in sentiment analysis.

Table 2 Annotation evaluation on annotators

4.3 Prediction models for SDVA task

In the experiments, the HAN model was used as a reference because the model has been proven to be effective in understanding the meaning of text. We proposed eight neural network models which base on HAN and two machine learning models to compare model performance; these are detailed as follows:

  1. 1.

    Baseline model:

    • RF: a Random Forest Regression model, which uses the TF-IDF approach to transform all words of the document into a vector.

    • SVR: a Support Vector Regression model with a linear kernel, which uses the TF-IDF approach to transform all words of document into a vector.

    • HAN: the baseline model (Yang et al., 2016).

  2. 2.

    Advanced model:

    • HAHN: a CNN is added to the HAN model. This CNN operates on the first layer of the NN.

    • HAKN: the proposed model, which uses Luong attention mechanism in the sentence layer with keyword-based attention.

    • HAHKN: a CNN is added to the HAKN model, which operates on the first layer of the NN.

    • HATN: the proposed model, which uses Luong attention mechanism in the sentence layer with title-based attention.

    • HAHTN: a CNN is added to the HATN model, which operates on the first layer of the NN.

    • HATKN: the proposed model, which uses Luong attention mechanism in the sentence layer with title-based attention and keyword-based attention.

    • HAHTKN: a CNN is added to the HATKN model, which operates on the first layer of the NN.

4.4 Hyper-parameters setup

Our proposed deep learning model has several hyper-parameters: hidden size for all sub-model is 100, 300 and 500; learning rates are 0.001 and 0.01; filter sizes of a multiple channel CNN are [2,3,4]; and another cell as LSTM cell is used to evaluation. The maximum number of word for title encoder and word-level encoder is 26 and 44, respectively. The number of sentence for sentence encoder is 12, and all words for keyword encoder have used. We set several parameters for RF regression including criterion that is MSE, numbers of estimators are [25,50,100], and numbers of min sample splitting are [2,10]. For SVR model, kernels are [linear, poly, rbf, sigmoid]; cost and gamma are 1; degrees are [3,5]. Noted that gamma is set in [poly, rbf, sigmoid] kernel, and degree is only set in poly kernel.

4.5 Performance comparisons regarding different estimations of Luong attention

In this experiment, we used Luong attention to focus on the sentences in the stock market news to predict SDVA scores, which included calculation methods such as concat, dot and general. Table 3 presents a comparison of the effectiveness of each model in each prediction method. The minimum validation among different prediction methods is presented for comparison among the models. Only the HAKN achieved the best results in dot. Its test loss was 0.864. Other models were better in the concat operation. HAHTKN was the best model, obtaining 0.772 test losses. Furthermore, in trend, 0.718 MSE and 0.452 MDAE on test data were obtained using the model. In trading, 0.826 MSE and 0.592 MDAE on test data were obtained. The results show that concat can maintain more information. The original information can provide a better estimate.

Table 3 Prediction performances with respect to difference estimation

4.6 Performance comparisons for different hidden sizes

We tested the hidden sizes of 100, 300 and 500. When the hidden size was equal to 500, HAN, HAKN, HAHTN and HATKN had a better effect with test losses of 0.834, 0.864, 0.832 and 0.841, respectively. HAHN, HAHKN, HATN and HAHTKN had a better effect when the hidden size was 300. Test losses were 0.822, 0.831, 0.802 and 0.772, respectively. All results are shown in Table 4.

Table 4 Prediction performances with respect to different hidden sizes

4.7 Performance comparisons for different recurrent neural network (RNN) cells

Table 5 presents results of the performances of different models on various cells. Two cells are used in the all experimental models, namely GRU and long short-term memory (LSTM). HAHTN and HATKN achieved the smallest test loss in the LSTM which were 0.832 and 0.841, respectively. However, other models except HAHTN and HATKN are exhibited superior results in the GRU. These two models which are better in LSTM cell that were exhibited 0.033 reduced average test loss compared with themselves using GRU cell. By contrast, the models that were superior in GRU exhibited a reduced average test loss of 0.056 compared with themselves using LSTM cell. Thus, the effect of GRU was better than that of LSTM.

Table 5 Prediction performances with respect to different RNN cells

4.8 Performance comparisons for different learning rates

Table 6 presents the results of different learning rates. Overall, when the 0.001 learning rate is used to model training, 100% and 87.5% of models had superior MSE and MDAE, respectively, in trend intensity (valence). In trading intensity (arousal), 87.5% and 100% of models had better MSE and MDAE, respectively. However, some models using the 0.01 learning rate have outperformed 0.001 learning rate such as the HAN has better MSE of trading intensity, HAHN has better MDAE of trend intensity.

Table 6 Prediction performances with respect to difference learning rates

4.9 Performance comparisons for CNN

The overall SDVA prediction performances of the models are presented in Table 7. Without the CNN, the HATN exhibited the best performance. The test loss was 0.802. Regarding trend, MSE of 0.660 and MDAE of 0.426 on test data were obtained. In trading, 0.943 MSE and 0.696 MDAE on test data were obtained. If adding the CNN, the HAHTKN exhibited the best performance, with test loss of 0.772, MSE of 0.718 and MDAE of 0.452 in trend and MSE of 0.826 and MDAE of 0.592 in trading. In addition, it can be seen that only HAHTN model exhibited worse prediction results after adding CNN extraction features. The cause of this situation is likely to be exaggerated title or the inconsistency of the title and contents. However, misleading by biased title is not a mistake only made by machine but also humans. Therefore, if we use title and keywords information, we can measure the weights in title and keywords to achieve better predictions. The result of this experiment concurred with this observation. The HAHTKN achieved this performance. With the information of title and keywords, the neural networks can pay more attention to important sentences in the content of summary which means the weight of these sentences is higher than others. Moreover, these models use the CNN filter, three experiments using CNN have best performance compared to those without CNN filter.

Table 7 Comparing the prediction performances of joining CNN filter

4.10 Overall performance comparison and statistical significance tests

The overall SDVA prediction performances of these models are presented in Fig. 5. We compared the three baseline models with our proposed model HAHTKN. The two machine learning models are RF and SVR. According to the experimental result, the best hyper-parameter setting for RF regression is that the number of estimators is 100 and min samples splitting is 2. Best hyper-parameters setting for SVR that kernel is linear. In trend prediction result, RF had MSE of 0.786 and MDAE of 0.387 and SVR had MSE of 0.699 and MDAE of 0.464. In the trading prediction result, RF had MSE of 1.092 and MDAE of 0.708, and SVR had MSE of 1.029 and MDAE of 0.712. HAHTKN model’s performance was better than two machine learning models. Its test loss (overall MSE) was 0.834. However, HAHTKN was the best model in our experiment.

Fig. 5
figure 5

Overall prediction performances of four models on test set

We also experiment with two statistical tests to prove that the losses of HAHTKN on the test set are significant improvements among the other three baseline models. A lot of researches have proved that their models can effectively predict trends of the stock market, but they lack discussion on the stock market trading. Therefore, we chose to use trading intensity for statistical tests. There are two statistical significance tests, the F test and t test to carry out the loss of the proposed HAHTKN and the other three models. The F test is to calculate whether there is a significant difference in the variance of the two populations, and the t test is to verify whether the difference between the mean of two populations is significant. We set F test and t test to be a one-tailed test and two-tailed test, respectively. The alpha of both tests was 0.05. The statistical results are shown in Table 8. In trading intensity, the difference between HAHTKN and RF is significantly different, because the p value is under 0.05 in F test. All p values in the t test are under 0.05. It represents that the loss of HAHTKN is significantly smaller than others.

Table 8 Results of significant test on compared models in trading intensity

4.11 Discussion

To illustrate this attention weight on these predication models, we will present a practical stock news as an example to show the differences. An example of a validation data set is presented in Table 9. According to the table, Sinosteel made a good profit last year, and the cash yield was not bad. Investors may believe that the current stock market trend is increasing. However, because the content does not mention the company’s current and future goals and growth, the trading intensity is medium. The average value of three annotations for its trend and trading is 7.0 and 5.0.

Table 9 A SDVA example of stock news

In Table 10, we present a comparison of the errors in trend and trading among six models. The overall error of the HAHTKN was only 0.405 (0.318 + 0.087), which was the smallest of all models. According to result in Table 11, the HAHTKN focused on Sentence 1, which resulted in better prediction of trading intensity. Sentence 1 clearly revealed the profit growth of the company, which was closer to the investor’s perspective in trading. By contrast, the HAHTKN did not exhibit the best trend intensity because the HATKN examines each sentence and the weight of Sentence 3 was higher. Sentence 3 explained the amount that will be allocated per share, which is also crucial information for investors to use to judge the stock market trend.

Table 10 Prediction performances of each model in this example

In the HATN and HAHKN, which focused on each sentence with different weights. The HAHTN focused on the most relevant title sentence, “the cash yield is about 4%” because it added the CNN to extract the feature vector. The HAHTKN was focused on “Sinosteel (2002-TW) net profit after tax increased by more than 40% last year.” The sentence that the HAHTKN followed was more specific than that of the HAHTN; thus, an impression was conveyed that the overall stock market was better than last year. This conclusion was similar to that of humans reading the articles. Specific and obvious growth draws attention and promotes purchase desire. Therefore, the HAHTKN exhibited the best performance in this experiment because it closely reflected human thinking.

Table 11 Attention weights of each model for each sentence

In the comparison of Luong attention’s calculation method, 66.7% of the models exhibited good effects on concat, and the other 33.3% of the models exhibited better effects on the dot operation. In the concat method, the sentence vector and the estimated vector of keyword and title were calculated together to obtain complete information during prediction. For hidden sizes of 300 and 500, the results were superior for a hidden size of 300 than 500. All models had better results when learning rate was 0.001. In the RNN cells, 75% of the models exhibited better results in GRU, and 25% had better results in LSTM.

5 Conclusion

We proposed a model for predicting SDVA sentiment in the stock market. According to all experimental results, the prediction performance of our proposed HAHTKN model has outperformed other HAN-based and machine learning-based baseline models. We used the title, keywords and summary of stock market–related messages to estimate all vectors and then calculated the sentence vector in the sentence encoder using Luong attention mechanism. The logic of the sentence encoder was similar to humans. In addition to identifying positive and negative stock market trends, the motivation of buying and selling is more important; the proposed HAHTKN model is expected to generate profit for investors because the thinking is similar to that of human beings when judging the stock market. In this experiment, it was proved that CNN feature extraction can effectively improve the effectiveness of the model when applied to our data set. Our proposed model has used word embedding to transform texts to vectors, but it cannot recognize words with multiple meanings. Our future works will focus on other pre-trained word embedding models such as the Elmo and BERT to improve performance in word embedding of keyword, title and word encoder and will include additional sentiment corpora to allow the system to capture more sentiment information such as sentiment word with valence and arousal.