1 Introduction

With the rapid development of the market economy, the financial market at the core of the market economy has also achieved a great breakthrough. On February 23, 2019, the Chinese President emphasized that “preventing and defusing financial risks, especially systemic financial risks, is the fundamental task of financial work.” It can be seen from this that financial risk is one of the cores of the development strategy all the time [1]. In fact, other countries in the world also attach importance to the development of the financial market. In order to better prevent risks, we need to adopt better methods to analyze them. Stock markets have become increasingly complex and changeable. The existing basic theories no longer apply to the changes in the stock market. The stock market is of great difficulty to predict based on the theory of random walks. Nevertheless, the research on stocks is still a cynosure [2,3,4,5]. Behavioral finance believed that stores are influenced by the words and actions of others to a large degree when they make decisions [1].

In the context of the development of information technology at full speed, the Internet has gradually penetrated and exerted a great influence on people’s daily life. Netizens can quickly obtain information and discuss their views on the Internet platforms, such as Weibo, WeChat and forum. These social media have become the main platforms to express people’s emotions [6]. Various opinions that express people’s thoughts are gradually generated. As time goes on, the generation of this information makes us gradually enter the era of big data, which inspires us to study various problems [7,8,9]. Apala et al. [10] predicted box office through the data from Twitter, YouTube and Internet Movie Database (IMDb). Golbeck et al. [11] adopted public information of the users on Facebook to predict their personalities. At the same time, the Internet has become the main platform for investors to communicate in the stock market. They could obtain many pieces of information about the stock market, macroeconomic indexes, expert comments and analyses, etc. The posts, post volume attention and other information in social media can all serve as the research basis for the stock market [12]. Users on the Internet express their true thoughts, attitudes, emotions, and other opinions on the network events through open and casual information communication platforms, thereby forming network public opinion. It can reflect the development trend of public opinions and help managers better understand public opinions [13]. Various social media gather the ideas of numerous people and lots of stock information can be mined from texts related to stock to evaluate the development trend of stocks [14,15,16,17,18,19]. In this paper, we obtain the text information from web crawler in East money. East money is an authoritative financial website with a large data flow and up-to-date information, which is also the website with the most visitors among all the financial websites. There are many pages in the posts on the forum. If you download information manually, it will be not only inefficient but also a heavy workload. To solve this problem, web crawler technology which grabs the relevant web page information directionally is generated. Web crawlers, also known as web spiders, are programs or scripts that automatically grab the information on the World Wide Web based on certain rules, and effectively access relevant web pages and links to obtain the required information according to the given capture target. In light of the system structure and implementation technology, web crawlers are generally divided into general crawlers, focused crawlers, incremental crawlers and deep web crawlers [20,21,22]. However, in practical applications, most crawlers are implemented in combination [23, 24].

The text classification is the most crucial for analyzing stock movement in this paper. In the era of big data, the traditional methods of text sentiment analysis mainly include artificial dictionary construction and machine learning. But the two methods not only cost lots of manpower but also have low efficiency and quality. Therefore, deep learning, an important research field in machine learning, is utilized for text analysis. Deep learning improves the accuracy of text classification by constructing a network model simulating the human brain and nervous system to analyze the texts and automatically optimize model parameters. Deep learning has multi-layer structures. The nonlinear mapping between these structures can make it better deal with complex functions. A large number of text data will cause a burden on the learning process. However, deep learning obtains the important variables of input data through a layer-by-layer learning algorithm to avoid the phenomenon of overfitting. At present, convolutional neural networks (CNN) and recurrent neural networks (RNN) mainly are adopted for text classification [25, 26]. Because the classification accuracy not only depends on the model but also has many uncertain factors. In this paper, we first compare the commonly adopted models in text classification and use the best performing model 1DCNN to implement sentiment classification in text experiments. Secondly, public opinion analysis, combines the text and data characteristics of the stock market to evaluate the stock trend. During this process, we choose long short-term memory (LSTM) network to complete the public opinion analysis experiment.

Nowadays, deep learning technology has gradually entered various fields including the financial world. We adopt 1DCNN to classify the text and analyze the sentiment of the text with the stock market, which is very significant research. The research not only expands the application field of deep learning but also brings good news to the financial market. This paper firstly analyzes the sentiment expressed by them through the title of posts in the Guba, and then analyzes the trend of the stock market combining text information with the historical data of the stock.

To sum up, the main contributions of this paper are listed as follows:

  • In this paper, web crawler technology playing the role of the information transmission channel is developed to obtain text data, in which the most important part is to access the web server (including user authorization and file download) and parse the required HTML files. Here the Python language is adopted to implement them with requests and Beautiful Soup libraries.

  • The commonly manual dictionary construction method is not adopted in this paper to process the text, but character embedding is utilized to realize the text classification. This method recognizes the emoticons in the text without considering the semantic and grammatical structure of the language.

  • Combining two features, this paper proposes a framework to enhance the assessment of stock market movement. The two features are the emotional feature of text and the stock price feature of stock market trading. Among them, the sentiment tendency of large-scale text data should be labeled manually before the feature extraction.

  • In order to obtain more comprehensive and valuable information to evaluate the stock trend, the characteristic information is extracted to analyze the trend of the stock and real multi-classification (the stocks’ rise and fall).

The remainder of the paper is structured as follows: Sect. 2 reviews some related works. In Sect. 3, we introduce the details of the proposed method. Section 4 describes and analyses the experimental results. In Sect. 5, we summarize the paper and outline further work in future.

2 Related Work

The most important point to analyze the trend of stock price by network public opinion is how to realize sentiment analysis on the obtained text. In this section, we would like to introduce the related work from the following two aspects.

2.1 Deep Learning in Text Sentiment Classification

Since the rapid development of deep learning technology in 2012, especially Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) have gradually been widely used in the field of Natural Language Processing (NLP). It makes text classification easier and the accuracy continuously improved. CNN is a feedforward neural network that adopts convolution calculations instead of general matrix multiplication. CNN reduces the connection between network layers by sharing weights. To reduce the number of parameters between network layers and avoid the overfitting risk, CNN also performs convolution operations on text information to extract their local features. The subsampling, also known as the pooling layer, contains max-pooling and mean pooling. The convolution and subsampling can reduce the complexity and parameters of the model. The most important part of CNN for text classification is the convolution layer. It can solve the problem of sequence information loss between words caused by traditional classification methods. After passing the convolution layer, n words can be combined to improve classification accuracy. The first use of CNN for text classification was proposed by Kim in 2004 [27]. In his experiment, the single-layer CNN was adopted to model the text, the preprocessed text word vector was taken as input, and then CNN was adopted to realize sentence-level fiction. Sun et al. [28] proposed a CNN Weibo sentiment analysis method that combined posts and comments with a new convolutional auto-encoder. It can extract contextual emotional information from Weibo conversations. Liao et al. [29] designed a simple CNN model and analyzed sentiment in the Twitter database to predict user satisfaction with products and specific environments, or the damage after a disaster. The method has higher accuracy in sentiment classification than traditional support vector machines (SVM) and Naive Bayes. Dos Santos et al. [30] proposed a new deep convolutional neural network that adopted information from character to sentence level for sentiment analysis of movie reviews and Twitter messages. CNN model is mainly used for short text, and the RNN model is generally used for long text. RNN is a kind of recurrent neural network that focuses on structure level and has memory function. By self-feedback neurons, they process sequences of arbitrary length. RNN preprocesses text of different lengths: truncating text for long data and filling text for short data. RNN takes each word as a time node, and the word vector as the input feature of the text. It usually combines sentences forward and backward to construct a bidirectional feature. The classification model of RNN is very flexible and has various structures. Zhang et al. [31] proposed a sentiment method based on an RNN. This method utilizes distributed word representation technology to construct a vector for words in sentences, and then uses RNN to train fixed latitude vectors for sentences of different lengths. In this way, the resulting sentence vector can contain word semantic and sequence features. Abdi et al. [32] proposed a method based on deep learning to classify the users’ opinions by comments. The method uses the advantages of RNN composed of Long Short-Term Memory (LSTM) and sequence processing to overcome the disadvantages of sequence and information loss in traditional methods. Yan et al. [33] designed the model of encoder and decoder structure by using LSTM, and solved the problems in time series prediction that multiple input features have different influences on the target sequence and the data before and after the sequence have strong time correlation by assigning the weights of different input features and time points. The experimental results show that the unified feature set learning method significantly obtains better performance than the method of learning from a feature subset. This paper selects 1DCNN to achieve the classification of the text sentiment through the experiment comparison on related models.

2.2 Analysis of Online Public Opinion

Social networks are full of people with different identities and educational backgrounds. For professionals, their opinions are often followed by others and they may become leaders in the investment market. However, for ordinary people, their lack of professional knowledge makes them unable to obtain accurate information, leading to blind following. With the advent of machine learning, these problems have also been solved to some extent.

Schumaker et al. [34] used financial news as a text extraction source, and support vector machines (SVM) as a classification model to predict the stock market. They studied the role of financial news in three different text feature representations: word bags, noun phrases, and named entities. The results showed that noun phrases performed better on stock prediction than word bags, and the prediction by SVM is better than that by linear regression. Some methods to analyze the stock market are by using the emotions conveyed by the text [35,36,37]. Bollen et al. [38] proposed a self-organizing fuzzy neural network to study the influence of Twitter sentiment value on the stock market price prediction. It was found that certain specific sentiments can improve the prediction accuracy in the stock price prediction. The sentiment states of this experiment are divided into Clam, Alter, Sure, Vital, Kind, and Happy, among which Clam can be used to predict Dow Jones Index. Patel et al. [39] proposed a Support Vector Regression (SVR), Artificial Neural Network (ANN), and Random Forest fusion method to predict the CNX Nifty and S&P index of the Indian stock market.

In recent years, deep learning belonging to machine learning is also attracting more and more attention in the financial market, mainly including CNN and RNN. Ding et al. [40] proposed a deep learning method based on event-driven stock market prediction, in which news events were selected by text and expressed by dense vectors before inputting the model. Then, the deep convolutional neural network was used to model the long-term and short-term effects of stock price movements. Li et al. [41] proposed a PCC-BLS framework based on the Pearson correlation coefficient (PCC) and generalized learning system (BLS), which was compared with 10 machine learning algorithms to obtain the best performance and the highest model fitting capability. Singh et al. [25] utilized a method of combing (2D)2PCA with deep learning to evaluate the stock. It was found that this method improved the evaluation accuracy by 4.8% compared with the Radial Basis Function Neural Network (RBFNN). Vargas et al. [26] focused on the structure of CNN and RNN to predict the standard intraday direction. Results showed that CNN can better capture semantic information in text, while RNN can better capture context information and model complex time characteristics for stock market forecasting. Because the trend of the stock market will ultimately be attributed to the discussion of time series, this paper chooses LSTM, a variant of RNN that is used to do time series investigation, analyze the stock market public opinion, evaluate its tendency and provide people with auxiliary investment reference.

3 Proposed Methodology

The overall frame structure is shown in Fig. 1. This paper analyzes and studies the stock market by feature fusion, including financial text and stock data features. Based on 1DCNN, firstly, the marked financial texts are considered to be the input of the classification model to realize the sentiment classification. Secondly, the text feature level fusion is realized according to its classification results, that is, the transformation from text data to numerical data (text sentiment value). Thirdly, the text sentiment value and the stock data are combined to get the time series data as the input of the public opinion analysis model. During this process, this paper adopts different durations and different dimension data for comparison. Finally, based on the analysis results, the stock trend is evaluated and some recommendable references are given.

Fig. 1
figure 1

Illustration of the basic frame structure. The data mainly include two parts: financial text and stock price data

3.1 The Algorithm Flow and Description

The flowchart is shown in Fig. 2. This paper mainly includes two types of features: one is the text features, and the other is stock historical data features. It is necessary to convert text into a vector as input because the text belongs to unstructured data. Text features and the technical features of stock data are the input to the model of this paper.

Fig. 2
figure 2

The flow diagram. The whole process includes two parts: financial text classification and stock market public opinion analysis

3.2 Web Crawler

IN this paper, As the starting point of this paper is to effectively evaluate the changes in the stock market in combination with the characteristics of online public opinion, which involves obtaining a large amount of information, such as post titles, post comments, and publishing time. As a data capture tool, the web crawler can quickly capture the specified information that needs to be obtained, so this paper uses web crawler technology to obtain text data related to the stock market. The frame of the web crawler in this paper is shown in Fig. 3. Its basic workflow is as follows:

  1. 1.

    Firstly, one or more web link Uniform Resource Locators (URLs) given in advance are taken as the initial page. In addition to text messages, web pages also contain hyperlinks, through which the web crawler system can get specific web pages. The web link given in this paper is determined based on the stock code, and the stock website is found according to the code.

  2. 2.

    Then, according to a certain network analysis algorithm, the links unrelated to the topic are filtered out, and the effective links are stored in the URL queue.

  3. 3.

    Finally, take the URL from the queue and download the corresponding web page. The corresponding time span in this paper is also set to crawl the information of a stock within a certain period. In this paper, we use the modules of //div [class = ” articleh”] and //div [class = ” articleh normal_post”] to get the number of the titles, the number of comments, the content of the title, the author and the time of posts. When accessing the content, the URL where the title is located is also stored in the Excel table, and an ID is set for each title.

  4. 4.

    In this paper, we need to grab the text data of five stocks. Thus, when finishing the data of the specified web page and the corresponding time span, we will analyze the next URL, put it into the queue to be grabbed, and enter the next cycle. Repeat the previous work until the conditions are met.

Fig. 3
figure 3

Web crawler frame diagram

3.3 Data Collection

In this paper, we choose the time period from 2nd Jan 2019 to 29th Mar 2019 to carry out the experiments. The period includes 290 trading days in total. Five representative stocks are selected and each stock is allotted 58 trading days. The data we need include two parts: the stock-related trading data and the title from the stock forum, in which the stock data are obtained from the public data set and the text data are collected in the form of a web crawler.

3.3.1 Text Data

Figure 4 shows the flowchart of obtaining text data in our experiment. The text data can be obtained from various social media. In foreign countries, they mainly crawl Twitter [34]. The extracted content generally includes the time of posts, the number of followers and the content of tweets. However, in China, we mainly crawl the Guba forum or Sina Weibo. The information is generally the reading, comment number, post title, post comments, and release time. What’s more, there are news data including ordinary news and financial news, such as Yahoo, the financial times and some domestic leading news sites [42,43,44,45]. In this paper, text data are obtained from Guba forum in the East money by the crawler. East money is a Chinese authoritative website among financial websites, with a large data flow and up-to-date information. There are many texts related to stock markets on this website. The text data information includes posts, post volume, attention, post time, user name and comments.

Fig. 4
figure 4

The process of obtaining text data

3.3.2 Stock Data

Stock data are generally obtained from stock exchanges. There are four stock exchanges in China including the Hong Kong Stock Exchange, the Taiwan Stock Exchange, the Shanghai Stock Exchange and the Shenzhen Stock Exchange. In addition, financial data can also be obtained by the Tushare package in Python. In this paper, we obtain stock data from the open dataset of the NetEase financial website for analysis.

The stock data comes from five different industries: the real estate industry, the high-end equipment manufacturing industry, the smart industry, the new energy industry and the banking industry. The five stocks are representative stocks in the industry. They are Vanke A, Aerospace Science and Technology, HengBao Shares, Woer Heat-Shrinkable Material, and Industrial and Commercial Bank of China (ICBC), respectively. Due to the long names of Aerospace Science and Technology and Woer Heat-Shrinkable Material, we will abbreviate them to AST and WHSM, respectively, in subsequent writing. The stock data generally include the opening price, closing price, lowest price, highest price, and volume. The stock data comes from the stock trading records of the stock exchange market, and the price and trading volume of each transaction constitutes the basis of the stock data. There is no fixed time interval between each transaction. There may be only a few low-frequency stock transactions in an hour, while there may be dozens of high-frequency stock transactions in a second. To record these data, it is usually recorded at fixed time intervals in the field of securities. Table 1 is the description and formula of the basic characteristics of stock data. We obtain all the characteristic data in the table to form the experimental data set. t represents a given trading day, and t-1 is the nearest trading day before t. The basic features are the technical features mentioned above in Sect. 3.1.

Table 1 Basic features and the formulas

3.4 Text Classification Model Based on 1DCNN

The types of text sentiment can be learned from the related literature. Text data are mainly divided into three or five categories. Three categories refer to bullish bearish and neutral [46, 47]. Five categories are strongly bullish, bullish, neutral, bearish, and strongly bearish [48]. Although the five types of sentiments are rare in the past arch, they clearly distinguish the emotional intensity. Hence, in this paper, to better understand the post sentiments, the post content is divided into five sentiments. Neutral refers to ambiguity (that is, bullish or bearish cannot be clearly expressed.) and some noise posts. To realize better calculation, STB represents strongly bullish, B represents bullish, H represents neutral, D represents bearish and STD represents strongly bearish in this paper.

Figure 5 gives the main model structure for text classification. The text of this paper is from the post title of the Guba forum of the authoritative website in China. Compared with English, its semantic grammar is more complex. In this paper, in order to pursue a network with low computing cost and superior classification performance, 1DCNN is adopted to realize text classification with embedded characters. Language, no matter what it is, is made up of characters. The characters used in this paper mainly contain 26 letters, 10 numbers, various other tag numbers, etc. After inputting the text, the model first constructs a vocabulary to form a character-level representation, and then uses one-hot encoding to quantify the characters. Text classification based on character embedding not only does not need to consider the single meaning of words (grammatical semantic information, but also can recognize the emoticons in the posts.

Fig. 5
figure 5

Illustration of the model structure for text classification. The embedding, conv1d, max-pooling, fully connected and softmax are contained in the model. \(\left\{ {W_{i} } \right\}\) represents a vector with length of \(i\) for post

The model mainly includes an embedding layer, convolution, pooling, fully connected layer, etc. The sentence length is set to 100, and the word vector dimension is 50. That is, the embedding layer dimension is 100 × 50. The convolution kernel is 3, and the learning rate is set as 0.001. The main component of the model is the temporal convolutional module [49], which computes a 1D convolution in the text classification of this paper. Assume we have an input function \(g\left( x \right) \in \left[ {1,m} \right] \to \Re\) and a kernel function \(f\left( x \right) \in \left[ {1,n} \right] \to \Re\). The convolution function \(Q\left( y \right) \in \left[ {1,\left\lfloor {\left( {m - n} \right)/d} \right\rfloor + 1} \right] \to \Re\) determined by \(g\left( x \right)\) and \(f\left( x \right)\) with stride \(d\) is shown in (1).

$$ Q\left( y \right){ = }\left( {f * g} \right) = \sum\limits_{x = 1}^{n} {f\left( x \right) \cdot g\left( {y \cdot d - x + c} \right)} $$
(1)

where \(c = n - d + 1\) denotes an offset constant. \(*\) represents the convolution operation. A set of kernel functions \(f_{ij} \left( x \right)\left[ {\left( {i = 1,2,...,l } \right),\left( {j = 1,2,...,k } \right)} \right]\) that we call weights are utilized to parameterize the module. \(l\) and \(k\) are the feature sizes of the input and output, respectively. \(g_{i}\) and \(Q_{j}\) are the input and output features. The outputs of the module \(Q_{j} \left( x \right)\) are gotten by a sum over \(i\) of the convolution between \(g_{i} \left( x \right)\) and \(f_{ij} \left( x \right)\).

To reduce the parameter number and avoid model overfitting, the temporal max-pooling is added behind the convolution layer. The 1D version of the max-pooling module in computer vision is used in the paper [49, 50]. Given an input function \(g\left( x \right) \in \left[ {1,m} \right] \to \Re\), the temporal max-pooling function \(Q\left( y \right) \in \left[ {1,\left\lfloor {\left( {m - n} \right)/d} \right\rfloor + 1} \right] \to \Re\) of \(g\left( x \right)\) is defined as:

$$ Q\left( y \right) = \mathop {{\text{max}}}\limits_{x = 1}^{n} g\left( {y \cdot d - x + c} \right) $$
(2)

where \(c = n - d + 1\) is an offset constant.

The cross entropy is adopted to calculate the loss in the classification model, and its formula is as follows.

$$ L = \frac{1}{N}\sum\limits_{i} {L_{i} } = \frac{1}{N}\sum\limits_{i} { - \sum\limits_{c = 1}^{M} {y_{ic} {\text{log}}\left( {p_{ic} } \right)} } $$
(3)

where \(M\) is the number of species. \(y_{ic}\) stands for the labels. If the class \(c\) is the same as the sample \(i\), \(y_{ic} { = }1\). Otherwise, it is 0. \(p_{ic}\) represents the prediction probability that the observation sample \(i\) belongs to the text category \(c\).

Adaptive moment estimation (Adam) [51] is considered as the optimizer for this paper. The advantage of Adam is that after bias correction, each iterative learning rate has a certain range. It makes the parameters more stable. In our experiment, a minibatch of size is 64. We also insert 1 dropout module in between the 2 fully-connected layers for regularization. They have the dropout probability of 0.5.

Finally, softmax is used to calculate the feature of text classification. The calculation formula is:

$$ X = {\text{softmax = }}e^{{z_{i} }} /\sum\limits_{k = 0}^{4} {e^{{z_{k} }} } $$
(4)

where \(z\) is the output of the fully connected layer. \(i = 0,1,2,3,4\) stands for five categories (STB, B, H, D, STD) in this paper and \(X\) represents the probability of the text classification output.

Since the data selected in this paper is a quarterly data, the text data of two months is taken as the training set, and the data of a month is used as the test set. During the training process, the training parameters of 1DCNN are backpropagated to reduce the error between the predicted value and the real value.

3.5 Public Opinion Analysis Model

The main structure of the public opinion analysis model is shown in Fig. 6. RNN is frequently used in time series analysis and other tasks [52, 53]. RNN is considered to be recurrent because the current sequence depends on the previous sequence and they are related to each other. In our dataset, we need a continuous time series as the input feature. However, RNN will not tackle the long-term dependence due to gradient problems. LSTM and Gated Recurrent Units (GRU) are proposed to solve the issue by employing gate control mechanism. As variants of RNN, they can make up for the loss caused by gradient disappearance to a large extent. In our model, we choose the LSTM because it has better performance in processing time series.

Fig. 6
figure 6

The main structure of public opinion analysis model. The type of the public opinion data belongs to time series. \(S_{t - 1}\) stands for the previous time series information, and \(h_{t - 1}\) represents the output of previous time series value. \(S_{t}\) is the current time series information we need to process, and \(h_{t}\) means the current result of the time series. a, b, c, and d are the offsets in different processes. The blue dotted box is cell state and the red dotted box indicates the forget gate. The purple represents the input gate, which contains old and new memories. Sky Blue stands for the output gate, which is utilized to output the results of public opinion analysis. \(y_{t}\) is the classification results through the softmax layer. 0, 1, and 3 are the public opinion categories

At time t, the calculated process of the LSTM is as follows:

$$ f_{t} = \sigma \left( {W_{f} \cdot \left[ {h_{t - 1} ,x_{t} } \right]{ + }b_{f} } \right) $$
(5)
$$ i_{t} = \sigma \left( {W_{i} \cdot \left[ {h_{t - 1} ,x_{t} } \right] + b_{i} } \right) $$
(6)
$$ o_{t} = \sigma \left( {W_{o} \cdot \left[ {h_{t - 1} ,x_{t} } \right] + b_{o} } \right) $$
(7)
$$ \mathop S\limits^{\sim }_{t} = {\text{tanh}}\left( {W_{S} \cdot \left[ {h_{t - 1} ,x_{t} } \right] + b_{S} } \right) $$
(8)
$$ S_{t} = f_{t} \cdot S_{t - 1} + i_{t} \cdot\tilde{S}_{t}$$
(9)
$$ h_{t} = o_{t} \cdot {\text{tanh}}\left( {S_{t} } \right) $$
(10)

where \(f_{t}\) represents the forget gate. It determines the state of the unit at the previous moment \(S_{t - 1}\). \(i_{t}\) stands for the input gate and \(o_{t}\) is the output gate (as shown in Fig. 6). \(\tilde{S}_{t}\) indicates new memory cell and \(S_{t}\) is the current public analysis information. \(W\) and \(b\) represent the weights and offsets in different operations, respectively.

Among the model of public opinion analysis, softmax is used to calculate the feature of public opinion. The calculation formula is:

$$ Y = {\text{softmax}} = e^{{z_{j} }} /\sum\limits_{k = 0}^{2} {e^{{z_{k} }} } $$
(11)

where \(z\) is the output of the fully connected layer. \(j = 0,1,2\) stands for down or flat or up and \(Y\) represents the probability of public opinion analysis output.

The loss function used in the public opinion analysis model is shown in (3). The optimizer is Adam, which is consistent with the text classification model. In the experiment, the ratio coefficient of train set and validation set is 0.85. The number of hidden nodes set at the beginning of this paper is 128. The range of learning rate is 0.001 ~ 0.000001, and the value of batch size is between 4 and 100.

4 Experiment and Results Analysis

IN this section, we discuss and compare several relevant methods. The time of experimental data (forum posts and historical stock data) is from 2nd Jan 2019 to 29th Mar 2019, the research content of this paper is divided into stock text classification and public opinion analysis, so we adopt CNN combined with 1DCNN to classify financial text, and then adopt RNN combined with LSTM to complete the public opinion analysis experiment, we list the parameters of the experiment in Table 2. The CNN and RNN adopt the same loss function and optimizer as in the literature [54, 55, 59]. In addition, the number of fully connected output channels in the first layer of the 1DCNN of the CNN model is the same as that of the literature [54], and the number of hidden layers of the LSTM in the RNN is the same as that of the literature [59]. In addition, parameters such as learning rate and period need to be adjusted to select the appropriate parameter values. Therefore, we adopt a 60% subset of the dataset as a training set to learn the sample fitting parameters, a 20% subset as a test set to evaluate the model performance and a 20% subset as verification set to find the appropriate parameter set. All the experiments were performed on a regular workstation (CPU: Intel(R) Core (TM) CPU i7-8700 CPU @ 3.20 GHz; GPU: GTX1070Ti; RAM: 32.0 GB).

Table 2 Setting of experimental parameters of the proposed method

4.1 Data Preprocessing

4.1.1 Text Data

The text needs to be preprocessed before classifying using networks, mainly including emotional marking and word vector conversion. As shown in Table 3, these are some examples of sentiment marking text from Guba. To distinguish sentiment better from the visual point of view, the core words are marked in the table. In our experiments, firstly, the text is manually marked, and then, the character embedding method is used to detect the accuracy of classification. The total number of processed data is 9066 (Among them, STB is 482, B is 1830, H is 4348, D is 2012 and STD is 394). The period of each stock is from January 2, 2019 to March 29, 2019 (290 trading days in total).

Table 3 Text examples from the Guba forum

After the results of sentiment classification in this paper, we need to calculate the sentiment of each emotion. The sentiment feature is as follows:

$$ P_{j} = \frac{{P_{t} }}{{P_{{{\text{STB}}}} + P_{{\text{B}}} + P_{{\text{H}}} + P_{{\text{D}}} + P_{{{\text{STD}}}} }} $$
(12)

where \(t\) is the types of sentiment. \(P_{j}\) represents the value of sentiment \(t\) on the day \(j\). \(P_{t}\), \(P_{{{\text{STB}}}}\), \(P_{{\text{B}}}\), \(P_{{\text{H}}}\), \(P_{{\text{D}}}\), and \(P_{{{\text{STD}}}}\) indicate the number of each sentiment on the day \(j\), respectively.

The overall day sentiment is as follows:

$$ P = {\text{Max}}\left( {P_{j} } \right) \times w_{i} $$
(13)

in which \(w_{i}\) denotes the weights of STB, B, H, STD, D, respectively, + 2, + 1, 0, -1, -2. The value of the weight is assigned by the reference [33]. The neutral sentiment is generally not included in the calculation of sentiment value. After calculating the mood value, it needs to be normalized by (14).

4.1.2 Stock Data

  • 1) Normalization

Because stock data changes greatly, the trend of stock data varies across industries. To compare with different stocks, the stocks need to be generally standardized after acquisition. The commonly used normalization methods include minimum–maximum normalization, decimal scaling normalization and Z-score normalization. In this paper, we adopt the most widely used minimum–maximum normalization, and its transformation function is as follows:

$$ x^{*} = \frac{{x - x_{{{\text{min}}}} }}{{x_{{{\text{max}}}} - x_{{{\text{min}}}} }} $$
(14)

where \(x_{{{\text{max}}}}\) and \(x_{{{\text{min}}}}\) are the maximum and minimum values of the sample data, respectively.

  • 2) Stock data classification

In stock trend evaluation, it is of more practical significance to estimate whether the stock trend will rise or fall sharply than only to evaluate the rise and fall. Therefore, we set up three categories of stock data in this paper. The specific classification standard is shown in (15).

$$ {\text{class = }}\left\{ \begin{gathered} 0, \;\;\; r < - 1\% \hfill \\ 1, \;\; \; - 1\% \le r < 1\% \hfill \\ 2,\; \;\; r \ge 1\% \hfill \\ \end{gathered} \right. $$
(15)

where 0 stands for a sharp drop, 2 is a sharp rise and 1 represents a gentle trend (r is the differential sequence as mentioned in Table 1). The rise and fall ranges are divided according to the stock data interval. It can be seen from the Fig. 7 that the price range of five stocks is generally distributed in – 5 and 5%, and the description of rising and fall in the text data is generally between – 5 and 5%. Hence, to balance the sample, this paper divides the stock data into three categories, with − 1 to 1% as the boundary. Figure 8 shows the sample sizes after classification. It is basically balanced, and only the number of samples with category 0 is less.

Fig. 7
figure 7

Differential sequence distribution. a–e are the different sequence points of the five stocks, respectively. From the vertical axis of the figure, it is found that the variation range is mainly between − 5 and 5%

Fig. 8
figure 8

Sample proportion of stock data with three classifications

4.1.3 Public Opinion Data

Public opinion data include text and stock data. Through different processing methods, the data are divided into five-dimensions: 3, 4, 5, 6, and 7. The time span is divided into 3 days, 5 days, 7 days, 10 days, and 15 days. They are considered as the input to the public opinion model.

The 3D data is mainly realized by the weight assignment of text and transaction data, and the formula involved is:

$$ Data_{{\text{3 - dimension}}} = \left( {{{text} \mathord{\left/ {\vphantom {{text} {2,{{\left( {T_{1} + T_{2} } \right)} \mathord{\left/ {\vphantom {{\left( {T_{1} + T_{2} } \right)} {2,r}}} \right. \kern-\nulldelimiterspace} {2,r}}}}} \right. \kern-\nulldelimiterspace} {2,{{\left( {T_{1} + T_{2} } \right)} \mathord{\left/ {\vphantom {{\left( {T_{1} + T_{2} } \right)} {2,r}}} \right. \kern-\nulldelimiterspace} {2,r}}}}} \right) $$
(16)

where \(T_{1} = \left| {O_{t} - C_{t} } \right|\). \(T_{2} = \left| {H_{t} - L_{t} } \right|\). \(O_{t}\) and \(C_{t}\) are the opening and closing prices, mentioned at time t in Table 1. \(H_{t}\) and \(L_{t}\) are the highest price and the lowest price at time t. \(T_{1i}\) and \(T_{2i}\) represent the difference of the corresponding stock prices at the time \(i\), respectively. \(text\) is the sentiment value of text and \(r_{i}\) is the difference sequence at the time \(i\) mentioned in Table 1.

The 4D data increase the stock trading volume based on the three-dimensional one. The calculation formula is defined as.

$$ Data_{{\text{4 - dimension}}} { = }\left( {{{text} \mathord{\left/ {\vphantom {{text} {2,{{\left( {T_{1} + T_{2} } \right)} \mathord{\left/ {\vphantom {{\left( {T_{1} + T_{2} } \right)} {4,{V \mathord{\left/ {\vphantom {V {4,}}} \right. \kern-\nulldelimiterspace} {4,}}r}}} \right. \kern-\nulldelimiterspace} {4,{V \mathord{\left/ {\vphantom {V {4,}}} \right. \kern-\nulldelimiterspace} {4,}}r}}}}} \right. \kern-\nulldelimiterspace} {2,{{\left( {T_{1} + T_{2} } \right)} \mathord{\left/ {\vphantom {{\left( {T_{1} + T_{2} } \right)} {4,{V \mathord{\left/ {\vphantom {V {4,}}} \right. \kern-\nulldelimiterspace} {4,}}r}}} \right. \kern-\nulldelimiterspace} {4,{V \mathord{\left/ {\vphantom {V {4,}}} \right. \kern-\nulldelimiterspace} {4,}}r}}}}} \right) $$
(17)

where \(V_{i}\) stands for the stock volume at time \(i\) mentioned in Table 1.

The five-dimensional data consist of stock opening price, closing price, highest price, lowest price and differential sequence. The 6D data increase text data. The 7D data add the trading volume of stocks based on the six-dimensions.

4.2 Text Classification

4.2.1 Results and Analysis of Confusion Matrix

IN order to observe the number of misjudged categories in text classification, the confusion matrix is considered as a standard format for accuracy evaluation. Figure 9 shows the confusion matrix result of text classification, which reflects the accuracy of classification from different aspects.

Fig. 9
figure 9

Confusion matrix of the proposed method. As the number of correct categories increases, the gradient color in the image darkens. Diagonals are the number of correct classifications for each category

The correct decision rate of each class is defined by the following formula:

$$ T_{{{\text{correct}}}} = \frac{{n_{{{\text{correct}}}} }}{{n_{{{\text{all}}}} }} \times 100\% $$
(18)

where \(n_{{{\text{correct}}}}\) represents the number of correct categories, and \(n_{{{\text{all}}}}\) stands for all sample.

Table 4 shows the results of correct samples in text classification according to (18). The higher the proportion of the text category, the less likely it is to be misclassified into other categories. From Table 4, the proportion of H is the highest, and its Tcorrect is 83.97%. Simultaneously, it is seen from Fig. 9 that the number of correct samples of H is 812, accounting for the largest proportion of the whole correct categories. However, the average classification proportion of the other four categories is relatively low. The main reason is that the text data itself is unbalanced, the neutral sample data accounts for the largest proportion, and the number of other types of samples is less.

Table 4 The proportion of correct samples

4.2.2 Comparison of Experimental Results of Different Methods

Table 5 shows the results of text classification via different methodologies. Since the text category in this paper is manually labeled based on the relevant financial information, text classification models are commonly used to detect the accuracy, to ensure the accuracy of the classification and the follow-up experiments.

Table 5 Text classification accuracy of different models

As shown in Table 5, this paper conducts comparative experiments according to five methods: Literature [56], Literature [57], Literature [58], Literature [59] and 1DCNN + Word2vec. Among the five methods, documents [56] and [59] use the original recurrent neural network for classification, and documents [57] and [58] are all text classification realized by using the variation of recurrent neural network and adding attention mechanism (by enhancing the function of feature extraction, thus improving the model effect) in the feature extraction part, in which the variation improves the deficiency of the original recurrent neural network to some extent. 1DCNN + Word2vec is a word vector processing model which adds Word2vec to one-dimensional convolution. Through the comparison of four methods to verify the effectiveness of this method, the input data of each network model has undergone the same preprocessing. Compared with the literature [56], literature [57], literature [58], literature [59] and 1DCNN + Word2vec, the classification model in this paper has obvious advantages in four indexes. Among them, the average accuracy of the proposed method is relatively improved by 4%, the precision is relatively improved by 4%, the recall is relatively improved by 5% and the F1 is relatively improved by 4%. In view of the classification model of the same dataset, our proposed method has better performance and classification accuracy, which is more worthy of praise.

4.3 Experiment Results of Public Opinion Analysis

  1. 1)

    Comparison of experimental results of different data characteristics.

Table 6 shows the experimental parameter settings and experimental results of public opinion analysis data with different dimensions and different time spans. There are five groups of data with different dimensions (three-dimensional, four-dimensional, five-dimensional, six-dimensional, and seven-dimensional), and each group of data contains five characteristic data with different continuous times (3 days, 5 days, 7 days, 10 days, and 15 days). The experimental parameters include learning rate, number of hidden layers, number of hidden nodes and activation function, and the experimental accuracy of each case is tested.

Table 6 Parameter setting and accuracy of different dimensions of public opinion analysis experiment

It is seen from Table 6 that compared with the experimental results of five-dimensional data without text information, other experimental results with additional text information are relatively good. Therefore, in theory, increasing the text information (public opinion information) can improve the evaluation level and provide a certain basis for the evaluation of the stock market trend. That is, we can use the text information on the Internet as a reference index for subsequent decision-making. At the same time, we can also add a guarantee for our investment. Secondly, it can be seen that the time span of different dimensions has different characteristics. Among the three-dimensional data, the time span is 3 days (average accuracy rate is 52.76%), four-dimensional data are 15 days (55.15%), five-dimensional data are 3 days (48.82%), six-dimensional data are 15 days (average accuracy rate is 54.55%), and seven-dimensional data are 10 days (average accuracy rate is 57.33%) with the best experimental results.

  1. 2)

    Experimental Analysis of Stock Samples with Different Industry Representatives.

Table 7 shows the representative stock-related information of the intelligent industry, including text samples and stock-related technical indicators (Ot, Ct, Ht, Lt, and r represent opening price, closing price, highest price, lowest price, and differential sequence, respectively.). Considering the space, the stock data of the other four industries are shown in appendixes A–D, respectively. T and P represent the real and predicted result of the public opinion analysis in the Table, respectively. The text samples are the randomly selected 5-day data, five text samples per day. The prediction value is realized based on the text and stock price data. The reason why category 0 is wrongly predicted as category 2 on the second day of the table is that (1) the interference caused by strong bullish and bullish samples in the text data; (2) the unbalanced samples. The data selected in this paper are quarterly data, and the sample number of category 0 is relatively small. On the third day, category 1 was predicted as category 2 mainly because compared with category 2, the sample number of category 1 was relatively small, which led to bias in emotional tendency. In addition, the common grounds of prediction errors are: (1) time interval; (2) data dimension; (3) the characteristics of text and stock price data and their weights distribution.

Table 7 The representative stock-related samples information of the intelligence industry

The above experimental results show that we cannot just rely on a single text feature or stock feature to evaluate the effect of the experiment, but analyze the result by combining with it different characteristics and data existing way (time span and data dimension). Consequently, under the condition of sufficient sample data, whether it is feature information or data representation ways, the experimental results can be better only within the appropriate limits.

4.4 Analysis of Stock Market Public Opinion

In our experiments, it is found that the experimental results with text information are better than those with single stock market trading information. The public opinion analysis of the stock market refers to the use of text information on the Internet to explore the trend of the stock market, that is, whether the text can be used as an indicator of the stock market evaluation. By observing the users of fortune.com, we find that in the stock market, whether online or offline, forums like “Guba” have become a platform for traders to obtain information, forming a large-scale social network. In this network, users interact with each other by adding friends and forwarding comments. Among them, text information is the most important basis for transmitting information. It will resonate with the content of comments and make certain information more inclined to comment on the author. Hence, text information uses the impact of user behavior in social networks to a certain extent, thereby further mapping the stock market price changes. Theoretically speaking, adopting textual information as an opinion indicator for stock market evaluation. Across the board, the sentiment changes corresponding to the text affect the trend of the stock market to some extent and can also supply investors with some constructive recommendations (people can reasonably make use of the changes of network sentiment and stock market historical data to make decisions). For example, when the sentiment of text information on a certain day is optimistic about the future market, investors can buy an appropriate share of the corresponding stock under the premise of comprehensive consideration. Otherwise, if the sentiment is negative, the investors will sell it.

5 Conclusions and feature work

The change of the stock market plays an important role in the trend of the national economy. With the popularity of artificial intelligence, the future study of the stock market has become a hot topic. Thus, the work in this paper is significant. The rise of the Internet has brought more and more attention to the stock market, giving investors the space to speak freely. According to these, this paper proposed a public opinion analysis framework of stock market, and compared the influence of different dimensions and different time spans on the results of the experiments. The experiment mainly includes two parts.

Firstly, for text data, we obtained the required public opinion data (corresponding to the text data of five leading stocks in different industries) through certain rules of crawler technology set in this paper, then cleaned the useless information in the text and manually labeled the text emotions, and finally applied the designed 1DCNN to classify text sentiment. In the data crawler, we adopted the regular expression Beautiful Soup in Python to get the data we really need. For text sentiment classification, we compared the experimental results of different models, as well as the results of different processing methods from the same model, and finally proposed a text sentiment classification model. The accuracy of our model is 74.38%, which is better than the other ones, and proves the effectiveness of the proposed method.

Secondly, this paper puts forward the public opinion analysis framework of the stock market. The above-mentioned text data is added to the stock market trading data. The sentiment value of text data mapping and trading data of the stock market are combined as the input features of the analysis model. In the public opinion analysis experiment, we compared the effects of input characteristics of different dimensions (they are composed of Chinese texts, stock price, and stock trading volume that have different weights. Among them, stock price includes opening price, closing price, highest price, and lowest price) and different time spans.

The methods proposed in this paper also have limitations. For the time series characteristics of stock price, the threshold we set in the process of finding trend characteristic points is affected by the receipt set, and manual setting will be more complicated. If you can automatically update the threshold, you can get twice the result with half the effort. As the text data used in this paper is the title of the forum, some titles may be incomplete or short, which are actually irrelevant to our research context. We calculate the stock sentiment intensity based on daily text. The incomplete or short text would affect the accuracy of the sentiment classification algorithm and the evaluation of stock market trends. Therefore, there is still some room for improvement in future. Adding targeted information sources (such as comments corresponding to titles, multiple social media data or stocks of the same type) and adopting other feature fusion mechanisms and weight distribution methods will further improve the experimental results, enhance the authority and reliability of the proposed methods, and prepare for further research.