Keywords

1 Introduction

There are a number of works aimed at predicting the movement of stock prices based on news [1,2,3,4,5]. But in these works, news headlines used as input data. News contains full information about the event, and the information in the headline may not reflect the whole essence of the news. At the same time, these works do not cover in detail the issue of preliminary processing of texts as input data, but only the analysis of a neural network.

Our idea is to predict the direction of price movement based on the content of the entire text of the news. Taking the news texts themselves as an input parameter leads us to the task of a deeper analysis of the news texts themselves and determining the best way to prepare the data before using it in machine learning models.

News is a set of n sentences, consisting of m words, where n and m are different for each news item. Initially, we need to understand what data format we need to pass on to the machine learning algorithms in order to work correctly. Every machine learning algorithm must have a constant number of input variables in the processed data. Therefore, any preprocessing method must create a fixed vector of variable length of news texts.

One of the key questions is how to get a fixed number of functions from a different number of news objects. This issue will be discussed in this article. In this article, we will check several data preparation algorithms for both classifiers (Fig. 1) and neural networks (Fig. 2) and compare the work of each of them.

Fig. 1.
figure 1

Work with classifier diagram.

Fig. 2.
figure 2

Work with neural network diagram.

2 Work Vector Representation Model

In this section, we will first introduce methods for vector representation of words. After that, we will consider approaches to obtaining a news vector from vector words, which will correspond to some average meaning of the whole news.

2.1 Bag of Words Model and Word Embedding

There are two main models of vector representation of words - Bag of Words [6] and Word Embedding [7]:

The Bag of Words model creates a representation of a document as an unordered collection of words with no knowledge of the relationships between them. This algorithm creates a matrix, each row of which corresponds to a separate document or text, and each column corresponds to a specific word. Table 1 shows that the cell at the intersection of the row and column contains the number of occurrences of the word in the corresponding document. The main disadvantages of this approach are the lack of semantic meaning of words and entire documents, and does not consider the word order, which plays a big role. Therefore, this approach is not suitable for the task set in this work.

Table 1. Word distribution in texts.

The Word Embedding model brings together a variety of natural language processing approaches. In this model, a corresponding fixed-length vector in an n-dimensional vector space is constructed for each word. But it is built in such a way as to maximize the semantic connection between words. One of the ways to express the semantic connection of words in a vector space is cosine similarity. It is calculated according to the following formula (1):

$$ \begin{array}{*{20}c} {cos\left( \theta \right) = \frac{A \cdot B}{{\left\| A \right\|\left\| B \right\|}} = \frac{{\mathop \sum \nolimits_{i = 1}^{n} A_{i} \times B_{i} }}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} \left( {A_{i} } \right)^{2} } { } \times \sqrt {\mathop \sum \nolimits_{i = 1}^{n} \left( {B_{i} } \right)^{2} } { }}}, } \\ {where\,A_{i} \,and\,B_{i} - is\,the\,i - th\,element\,of\,vectors\,A\,and\,B} \\ \end{array} $$
(1)

Within the Word Embedding model, the following algorithms have been developed:

  • Word2Vec [8]

  • GloVe [9].

In the field of NLP, these models are the most modern and in demand today. Therefore, to create a vector representation of words that contains the semantic meaning of words, we will use the Word2Vec and Glove models.

2.2 Preparation of Texts

At this step, we have a vector representation of each word in the news corpus. A word represents a vector in n-dimensional vector space, that is, a word is a set of n - features, note that n is fixed. But in every news, the number of words is always different. Therefore, it is necessary to bring each news to a fixed number of signs.

Simple Averaging

The first way is simple - it is to take all the vector words belonging to the news and calculate the average vector, following formula (2):

$$ \begin{array}{*{20}c} {\overrightarrow {{\omega_{{average{ }}} }} = \frac{{\mathop \sum \nolimits_{i = 1}^{m} \overrightarrow {{\omega_{{i{ }}} }} }}{m},} \\ {where\,\omega_{{i{ }}} \,vector\,of\,the\,i - th\,word,\,m{ } - { }the\,number\,of\,words\,in\,the\,news.} \\ \end{array} $$
(2)

Thus, we get a vector corresponding to the whole news. And if the vector of each word reflects the semantic meaning, then the vector of the whole news reflects some average meaning of the news. As a result, each news item is matched with a vector of dimension n, where n is fixed.

For example, the news: “Shanghai stocks opened lower, and the yuan was weaker against the dollar on Wednesday. Other Asian markets also declined. Hong Kong’s benchmark declined 2% as the city faced heightened tensions”. Converted to a single vector as follows:

$$ \left. {\begin{array}{*{20}c} {{\text{Shanghai}} = \overrightarrow {{\left( {0.345,{ }0.101,{ } \ldots ,{ }0.640} \right)}} } \\ {{\text{stocks}} = \overrightarrow {{\left( {0.783,{ }0.089,{ } \ldots ,0.554} \right)}} } \\ { \cdots = \ldots } \\ {{\text{heightened}} = \overrightarrow {{\left( {0.421,{ }0.484,{ } \ldots ,{ }0.054} \right)}} } \\ {{\text{tensions}} = \overrightarrow {{\left( {0.383,{ }0.211,{ } \ldots ,{ }0.954} \right)}} } \\ \end{array} } \right\} \to \overrightarrow {{\omega_{{average{ }}} }} = \overrightarrow {{\left( {0.497,{ }0.301,{ } \ldots ,{ }0.628} \right)}} $$

Term Frequency - Inverse Document Frequency (tf-idf)

Words in the news have different meanings or importance. Then it is worth not just counting the average, but constructing a linear combination, where each vector will be multiplied by a coefficient corresponding to the importance of the word. This idea is contained in the tf-idf (term frequency - inverse document frequency) formula [10]. It considers the frequency with which the word occurs in the text, a weighted average is taken.

tf - the ratio of the number of occurrences of a certain word to the total number of words of one text document.

$$tf\left(t,d\right)=\frac{{n}_{t}}{{\sum }_{k}{n}_{k}},$$
(3)

гдe \({n}_{t}\)– is the number of times the word t has been mentioned in the document, and the denominator is the total number of words in the document.

idf is the inverse of the frequency of occurrence of a word in all documents. Accounting for idf reduces the weight (weight) of commonly used words.

$$idf(t, D)=log\frac{|D|}{|\left\{{d}_{i}\in D|t\in {d}_{i}\right\}|},$$
(4)

where \(|D|\) – number of documents, \(|\left\{{d}_{i}\in D|t\in {d}_{i}\right\}|\) – the number of documents in which we have \(t\)(when \({n}_{t}\ne 0\)).

Hence, the tf-idf measure is the product of two factors:

$$ tfi{ - }df\left( {t,d,D} \right) = tf\left( {t,d} \right) \times idf\left( {t,D} \right) $$
(5)

Words with a high frequency within a particular document and with a low frequency of use in other documents will receive a lot of weight in tf-idf.

So, using the tf-idf measure, a linear combination of vectors is built and divided by the number of vectors. As a result, we get a vector of fixed length, which is also some mean sense of the news.

The example from “Simple averaging” the transformation looks like formula (6):

$$ \left. {\begin{array}{*{20}c} {\overrightarrow {{\left( {0.345,{ }0.101,{ } \ldots ,{ }0.640} \right)}} \cdot 0.76} \\ {\overrightarrow {{\left( {0.783,{ }0.089,{ } \ldots ,0.554} \right)}} \cdot 0.34} \\ { \cdots = \ldots } \\ {\overrightarrow {{\left( {0.421,{ }0.484,{ } \ldots ,{ }0.054} \right)}} \cdot 0.56} \\ {\overrightarrow {{\left( {0.383,{ }0.211,{ } \ldots ,{ }0.954} \right)}} \cdot 0.48} \\ \end{array} } \right\} \to \overrightarrow {{\omega_{{average{ }}} }} = \overrightarrow {{\left( {0.385,{ }0.094,{ } \ldots ,{ }0.670} \right)}} $$
(6)

Key Words

Certainly, when calculating the average meaning, the loss of information is possible since the meaning turns out to be very approximate and there are many words that do not carry meaning. Let’s try to remove words that do not carry meaning.

We extract a fixed number of keywords from each news item. Further, to transfer several words to the algorithm sequentially, we will use a recurrent neural network LSTM (Long short-term memory). It is a recurrent neural network that can store values for both short and long periods of time.

The YAKE algorithm [11] developed by a team of French scientists [12] in 2018 was chosen to extract keywords. YAKE is an unsupervised automatic keyword extraction method that relies on the statistical characteristics of the text. Moreover, not only nominal and non-nominal entities are extracted in the form of words and noun phrases, but also predicates, adjectives and other parts of speech that carry key information of the news. In [12], the method is compared with ten modern unsupervised approaches such as tf-idf, KP-Miner, RAKE, TextRank, SingleRank, ExpandRank, TopicRank, TopicalPageRank, PositionRank and MultipartiteRank), and one supervised method (KEA). According to the results of the authors of the article, the YAKE method shows the best result.

In each news, using the YAKE algorithm, all keywords are marked in each sentence. Further, all unmarked words are deleted and as a result, only significant words remain. After that, using Word2Vec or Glove, all words are translated into the corresponding vectors, and news, consisting of a sequence of vectors, is fed into the LSTM network, where growth/fall markers (1/0) are at the output. Figure 3 depicts a network model with inner LSTM layers and a convolutional layer.

Fig. 3.
figure 3

Model with inner layers.

Let’s give an example of processing news using keywords:

“Shanghai stocks opened lower and the yuan was weaker against the dollar on Wednesday. Other Asian markets also declined. Hong Kong’s benchmark declined 2% as the city faced heightened tensions”.

Therefore, we get:

“Shanghai stocks opened lower yuan weaker dollar Wednesday. Asian markets declined. Hong Kong’s benchmark declined city faced heightened tensions”.

3 Experiments

3.1 Data

The experiments were conducted on the basis of a set of news collected from BBC News, Breitbart News, CNN, The New York Times, Reuters, Washington Post, Bloomberg and Yahoo News for the period from November 2018 to August 2019 (Table 2). Next, data were taken with prices S&P500 Index. We use the closing price as the price.

Table 2. Time intervals and amount of news.

We use the news content published over the course of an hour to predict the S&P 500 movement up or down, comparing the closing price at t + 1 with the closing price at t, where t is the trading hour.

3.2 Implementation Details

For Word2Vec and GloVe algorithms, pre-trained word-vector dictionaries exist. So, for Word2Vec there are pre-trained word-vectors on Google news with a dimension of 300. There is a dictionary for GloVe trained on various textual data, including on news with Common Crawl with a dimension of 300. Based on pre-trained word vectors, we will train our word vectors, and for comparison of results we will take the same dimension 300.

To evaluate news processing approaches, we will use classifiers as machine learning algorithms:

  • Extra trees

  • Support Vector Classifier (SVC) with rbf

  • Random Forest

  • Logistic regression

  • Linear SVC

  • Naive Bayes

  • Multilayer perceptron.

And also, for keywords - the recurrent neural network LSTM.

As an estimate, we will use the accuracy of the coincidence of the growth and fall of the true and predicted values.

3.3 Prediction Algorithms

Both Word2Vec and GloVe have two subsamples of experiments - these are eigenvectors of words and pre-trained word vectors. Word2Vec and GloVe methods are used in all experiments.

Next, the Averaging and tf-idf methods form a fixed vector. This vector is fed into classifiers and one standard deep learning model:

  • Extra trees

  • SVC with rbf

  • Random Forest

  • Logistic regression

  • Linear SVC

  • Naive Bayes.

  • Multilayer perceptron.

But for the keywords method, only the LSTM recurrent neural network is suitable. Because the above classifiers and the standard deep learning model do not have such properties as LSTM - to accept as input a sequence of the same type of feature vectors and extract new features from them. In each model, the best hyperparameters are selected and cross-validation is carried out.

3.4 Results

The results of comparing news preparation approaches to machine learning algorithms in Tables 3 and 4 show that pre-trained vectors give a more accurate vector representation of words. This is due to the fact that the amount of news is probably not enough to determine the exact context of words in a vector representation. A comparison between the approaches of simple averaging and weighted averaging with tf-idf shows that simple averaging gives an idea of a certain mean sense worse than weighted averaging with the tf-idf measure. The vector representation algorithms for Word2Vec and GloVe work similarly, but GloVe in the end gives a 0.5% better result.

Table 3. Prediction results with Word2Vec preprocessing.
Table 4. Prediction results with GloVe preprocessing.

During the experiments in the method based on keywords, the optimal number of keywords was chosen equal to 1900. The experimental data are presented in Fig. 4. But it is worth noting that on average there are 2700 words in the news, which indicates that the selected number is approximately 2/3 of all words. As a result, the accuracy slightly exceeded the maximum value of averaging methods.

Fig. 4.
figure 4

Optimal number of keywords.

Further, a series of experiments was carried out with the dimension parameter of vectors in the case of training our vectors, since the maximum dimension of pre-trained vectors proposed by the authors of the algorithms does not exceed 300. Figures 5 and 6 show schedules of changes in prediction accuracy with respect to the dimension of word vectors.

Fig. 5.
figure 5

Prediction changes in Word2Vec

Fig. 6.
figure 6

Prediction changes in Glove.

It can be noted that in both cases, when the dimension reaches 1900, the accuracy of the prediction ceases to grow, and in the future it even worsens. As a result, by increasing the dimension of the vectors, it was possible to increase the accuracy by 0.5–1%.

4 Conclusion

In this paper, we consider news processing methods for feeding into machine learning algorithms. Algorithms based on vector averaging are easy to process, but they can lead to loss of information, such as the word order in the news.

The algorithm based on the extraction of keywords, that is combination of the most important words and LSTM showed the best result (57.63% with Word2Vec and 58.97% with GloVe) and has the prospect of development in such tasks.

To improve the result, a deeper analysis of the texts is necessary, including the determination of the positive or negative meaning of the news. We also believe that using the new Hierarchal Attention Network model can improve news prediction.