Keywords

1 Introduction

One of the examples of big data that has emerged relatively recently is information flows generated by news agencies, enterprises, organizations, social networks, etc. Such real time information flows feature a huge amount of news sources, unstructurability, a large amount of both sources and objects of news, and high frequency (thousands of news items per second). Therefore, one of the problems of information systems that process such data is to aggregate them into one or several indicators that allow to describe the intensity, stability, changing news flow structure, identify the most discussed news subjects, and so on. Note that the characteristics and features of the news flow are currently not sufficiently studied, and the methods and algorithms for processing such data are not fully developed.

In this paper we use some methods of the graph theory and corresponding algorithms to examine news flow data. A social network analysis (SNA) approach allows to investigate (explain) structures in systems based on the relations among the system’s components (nodes) rather than the attributes of individual cases [6]. Methods of social network analysis can be used to analyze the structure of relations in an organization [7, 8] or examine relationships among social entities [44]. The basic concepts of SNA are node and link, where a node refers to a unit (individual, object, item) and a link indicates a relationship between nodes.

News flow data can be easily converted into network data, since a company can be presented as a node and the act of mentioning two companies in one news item can be visualized as a link between them. Moreover, using this co-mention network we can study basic properties of networks, such as centrality and tie strength. Highly mentioned companies may be treated as key companies which are more significant to economy than other companies. Central nodes in this co-mention network are key companies in co-mention analysis. Moreover, we can treat the amount of co-mentions as the link weight. In this case, highly mentioned companies will have a higher value of link weight.

The paper [35] used various types of social network analysis metrics and citation indices to find key companies in the network. The focus of this paper is the analysis of different parts of co-mention network (or subnetworks) rather than the co-mention network as the whole. In our study, we use two ways to construct subnetworks:

  • subnetworks represent different sectors of economy, such as products, resources, services, IT sectors.

  • each subnetwork consists of companies with the same stock exchange affiliation.

The concept of link weight is crucial to our analysis. Through examination frequencies and other SNA metrics of each company within a subnet, co-mention analysis of companies can identify the key companies of the subnetwork. For co-mentioned companies, the weight of links may be useful for identification of the most frequently co-mentioned pairs of companies, as well as those who are poorly co-mentioned but useful in terms of providing diversity to the sector or subnetwork where they belong. We suppose that companies with weak links to other companies may play a valuable role in expanding the diversity of economic information within their subnetwork or clique.

In our research we would like to find answers to the following questions:

  • Do the frequency, degree centrality, closeness centrality, betweenness centrality and eigenvector centrality of companies vary within subnetworks (clusters)? Within each group (or subnetwork) of companies, our research is going to find those which are more central than others. It is assumed that the more central a company is within a network, the more influential and more important the news about it must be.

  • Does the type of the degree distribution vary within clusters? What type of the functional form has the clustering-degree relation for the clusters?

  • Does the analysis of company co-mention network identify groups? Our hypothesis is that the network analysis of the company co-mention network reproduces the sector structure of the economy. The company co-mention network may be very sparse, but the graph of companies is expected to show which of them are often mentioned together or which belong to the same sector of economy. Each group or cluster of companies might be associated with a particular sector, for example, products, resources, services or information technology. Another hypothesis we would like to examine is that the structure of the company co-mention network reflects the clusterization of companies based on their stock exchange affiliation.

To provide answers to the first and the second questions we will use well-known SNA metrics and methods. The answer to the third question will rely on the quadratic assignment procedure (QAP) which was proposed and developed in [13, 18, 21, 27]. Since then, QAP has been widely used in social network analysis (see e.g. [5, 9, 10, 17, 34, 42], among many others). QAP is a peculiar type of permutation test which leaves the dyadic data structure under the permutations unchanged.

Our research questions are close to ones from the paper [20] which investigated whether the patterns of author co-citation can describe the structure of the field of communication.

Our study uses the data delivered by the news analytics providers. In our opinion, the data are quite typical and can be used for processing and analysis. Using this data, papers [36,37,38,39] examined different news flow characteristics, such as intensity, stability, volatility, long-term memory, fractality, etc.

2 Data

A huge amount of economic and financial news are generated in real time by news agencies, stocks exchanges, companies, magazines, papers, blogs and so on. Different companies are named in these news items. We accomplish company co-mention analysis in the five steps:

  1. 1.

    we assemble all economic and financial news items produced during one month of 2015 (February of 2015);

  2. 2.

    we collect a list of mentioned companies for each news item;

  3. 3.

    we calculate a weighted co-mention count for each pair of co-mentioned companies based on all set of available news items;

  4. 4.

    we produce a symmetric co-mention matrix using these weighted co-mention counts;

  5. 5.

    then we analyze this co-mention matrix statistically, and the results are visualized and interpreted.

Step 1 and 2 operations are executed by providers of news analytics. In our research, we manipulate these already processed data.

2.1 News Analytics Data

Two of the biggest providers of news analytics are Thompson Reuters and Raven Pack. They gather news items from diverse sources in real time. They collect data from different sources including news agencies and social media (blogs, social networks, etc.). They also use so-called pre-news, i.e. SEC reports, court documents, reports of various government agencies, business resources, company reports, announcements, industrial and macroeconomic statistics. Then news analytics providers handle preliminary analysis of each news item in real time. Using AI algorithms, they calculate news-related expectations (sentiments) based on the current market situation. As a rule, providers of news analytics provide to subscribers in real time the following attributes for each news item: time stamp, company name, company id, relevance of the news, event category, event sentiment, novelty of the news, novelty id, composite sentiment score of the news, among others. Subscribers of news analytics data may develop and exploit quantitative models or trading strategies based on both the news analytics data and financial time series data. The survey of applications for news analytics tools can be found in books [28, 29].

2.2 Generating Company Co-mention Network

Methodology. Company co-mention network is formed based on co-mention; it means that a company has connection with those companies that have been mentioned in a news item together. A company co-mention network is a set of companies which have connections in pair to represent their co-mention relationship. Two companies are linked if there has been published a publicly available news item mentioning both of them. In such type of network, a company will be represented by ’node’ or ’vertex’ and the connection will be an ’edge’. Thus, we represent the company co-mention network as an undirected weighted graph. In some sense, company co-mention network can be considered as social network. Based on the available data of news analytics, we built an adjacency matrix which represents the relationship between companies in line with the approach described in [35].

Network. We deal with all the financial and economic news items released during one month period from February 1, 2015 to February 28, 2015 (i.e. 20 trading days). We eliminated all the news on the imbalance of supply and demand before both the opening and the closing of trading time of different stock exchanges. News of such type may amount to several hundreds of news coming out in a short time at the beginning and at the end of the trading sessions. During February 2015, there were published more than 230 thousand news items which mentioned more than 18,000 companies. We assembled the list of all companies which have at least one common news item with at least one other company. There are more than 7,000 such companies during February, 2015.

Table 1 presents the descriptive statistics of time series.

Table 1. The descriptive statistics, one day

After obtaining non-directional symmetric matrix with valued weights for the co-mention counts of each pair of companies, we use R packages for finding basic statistics and to visualize the network of companies.

2.3 Social Network Analysis

Social Network Analysis (SNA) describes social relations in terms of graph theory. SNA presents objects (e.g. individuals, groups, organizations, URLs, and other connected entities) within the network as nodes, and links represent relationships (e.g. friendship, co-authorship, organizations and sexual relationships) between the objects [1, 14, 22, 25, 40].

Social networks can be represented in the form of a diagram, where nodes are points and links are lines. SNA deals with measuring relationships between objects [12, 26, 30, 44].

The nodes in our network represent companies and the links represent co-mention relationships between the nodes. SNA provides both visual and mathematical analysis of relationships [33, 41].

Recent years have seen increased interest in the study of Social Media using SNA (see e.g. [3, 11, 23, 45, 47], among many others).

Key objects are those that are in relationships with many other objects. In the context of our analysis, a company with extensive links or co-mention with many other companies in the economy is considered more important than a company with relatively fewer links. Different types of SNA metrics can be used to find key companies in the network. In our analysis, we use the following well-known metrics: degree centrality, closeness centrality, betweenness centrality, eigenvector centrality, frequency. A detailed description of these metrics can be found in the article [25].

2.4 Key Company Analysis According to Sectors of Economics

In addition to the co-mention matrix, we create a matrix describing which of ten economic sectors the companies belongs to (extraction of consumer discretionary, consumer staples, energy, financial, health care, industrials, information technology, raw materials, telecommunications services and utilities) and which stock exchange the companies is related to.

All the companies we considered were divided into four sectors of the economy in the following way:

$$\begin{aligned} \left. \begin{array}{r} \text {Consumer discretionary} \\ \text {Consumer staples}\\ \text {Industrials} \end{array} \right\}&\longrightarrow {} \text {Products (2007 companies)} \\ \left. \begin{array}{r} \text {Energy } \\ \text {Raw materials} \end{array} \right\}&\longrightarrow {} \text {Resources (1398 companies)} \\ \left. \begin{array}{r} \text {Financials} \\ \text {Health care}\\ \text {Telecommunications}\\ \text {Utilities} \end{array} \right\}&\longrightarrow {} \text {Services (2375 companies)} \\ \left. \begin{array}{l l} \text {Information technology} \end{array} \right\}&\longrightarrow {} \text {IT sector (548 companies) } \end{aligned}$$

Table 2 lists the top-5 companies with the highest frequency of co-mention in each of the 4 sectors. The table also shows the number of company’s links and different standardized centrality indicators. As you can see from this table, General Motors with high frequency and with high degree is a key company in the products sector. It strongly dominates over the rest of the companies, so that it has the largest number of news co-mentions and links to other companies. We noted that in terms of proximity (closeness centrality), all considered companies are identical. This is typical for all sectors of the economy. Most of the largest companies in this sector belong to auto groups or aircraft manufactory.

Table 2 also shows the largest companies in the Resources sector. The leading companies in this sector are oil and gas companies, which is reasonable. At the same time American companies Apache Corporation and Continental Resources lead in co-mention frequency and number of links. The high of the Eigenvector centrality indicator proves that Apache Corporation and Continental Resources interact with other large companies in the sector. It acts as the bridge more often (it is a connecting link) connecting the companies of this sector.

In the services sector, there are several leaders which are the largest financial holdings: JPMorgan Chase, Citigroup, Bank of America. This group of companies are also leaders in all measures of centrality considered. Apple is the leader in the information technology sector for all of the indicators with a large margin. Earlier, in our article [35] it was shown that Apple is the leading (key) company among all the companies under consideration.

Table 2. Companies with higher frequency for four sectors of the economy

2.5 Key Company Analysis According to Stock Exchanges

Next, a network of co-mentions of companies belonging to different territorial zones was studied. We considered separately companies that are trading on the European, American and Asian stock exchanges. Precisely, we analyzed companies that traded on London SE (488 companies), New-York SE (1,715 companies) and Tokyo SE (586 companies).

Table 3 shows that the key companies of the London Stock Exchange in terms of the number of links are Barclays Bank PLC and Aviva PLC. Mining company BHP Billiton PLC has the largest number of co-mentioned news and interacts with the largest companies of this exchange. Further in the table the key companies of the New York Stock Exchange are given; all of them belong to the oil and gas industry. At the same time key companies of the Tokyo Stock Exchange relate to engineering and industrial sectors.

Table 3. Companies with higher frequency for three stock exchanges

2.6 Degree Distribution Analysis

In this section, the distribution analysis for the sub-graphs on 4 sectors of the economy and for the 3 stock exchanges is provided. The description of this procedure can be found in the articles [2, 19].

A degree distribution is called power-law distribution if

$$ n(k)\sim Ak^{-\gamma }, $$

where \(\gamma \) is degree exponent.

For all stock exchanges under review (Table 5) and economic sectors (Table 4) the distribution of degrees follows the power law. Figure 1 shows the dependence of the number of companies n(k) on node degree k for NYSE.

The resulting models are statistically significant at alpha level of 0.01. However, the degree exponent for all stock exchanges and sectors of the economy does not fall in the interval (2, 3). Thus, the decrease in the degrees of vertices is slower than for typical social networks [16, 24]. It is also notable that the fewer companies are in a subgraph, the slower the degree of vertices decreases are and the less the coefficient of determination is.

Table 4. Degree distribution for four sectors of the economy
Table 5. Degree distribution for three stock exchanges
Fig. 1.
figure 1

The degree distribution of the New-York’s companies co-mention network

2.7 Clustering Coefficient Distribution

In this section, clustering coefficient distributions analysis will be carried out for sub-graphs in 4 sectors economy and for 3 stock exchanges described above. Description of these procedures can be found in articles [4, 15, 43].

The average clustering coefficient of nodes C(k) with degree k has been found:

$$ C(k)\sim B k^{-\beta }, $$

where the exponent \(\beta \) usually lies between 1 and 2 [31, 32, 46].

For the given networks, the clustering-degree distribution relation follows the power law.

The resulting models are statistically significant at alpha level of 0.01. Herewith, the exponent \(\beta \) is turned out less than 1 for all the subgraphs under consideration (Tables 4 and 5).

It should also be noted that the power dependence between the local clustering coefficient and the degree is manifested for sufficiently large degrees of the vertices of k. There is no dependence for relatively small k. This fact is typical for all subgraphs under consideration and it can be well observed in the example of a subgraph (Fig. 2). Herewith, the exponent \(\beta \) is turned out less than 1 for all stock exchanges and sectors of the economy (Tables 6 and 7).

Sometimes the flow contains news which mention a large number of actors (companies). For example, about 0.5% of news reports allude to 10 or more companies. In such cases, the procedure we used to fill the incidence matrix generated a “pleiad” - a subgraph that included all possible verges between the actors mentioned in the report, which led to deviations of the vertex degree and clustering coefficient values from the general pattern. Note that exclusion of a large number of co-occurrences from the news analysis eliminated this problem. However, this method leads to a significant loss of information. It seems more accurate to use a modified procedure for clustering coefficient calculation that takes into account the peculiarities of the co-occurrences flow. We are planning to do this in the future.

In further research, we are taking into account this feature of the news flow, particularly in procedures for the news flow filtering and decomposing. We propose to allocate and consider repetitive co-mentions (a stable part of the graph in time), co-mentions caused by certain events in politics or economy (eventual co-mentions), and, finally, random perturbations separately.

Table 6. Local clustering coefficient for four sectors of the economy
Table 7. Local clustering coefficient for three stock exchanges
Fig. 2.
figure 2

The clustering-degree distribution of the New-York’s companies co-mention network

3 QAP Correlation and Regression Analysis

Using the co-mention matrix and the companies’ sector affiliation matrix (as well as the stock exchange affiliation matrix) we conduct QAP Correlation Analysis. QAP (Quadratic Assignment Procedure) was proposed and developed in [13, 18, 21, 27]. We use QAP Correlation Analysis to identify correlations

  • between the co-mention network and companies’ sector affiliation,

  • between the co-mention network and stock exchange affiliation.

With the co-mention network as a prime network, corresponding cells of the sector affiliation matrix (as well as the stock exchange affiliation matrix) are compared to compute the value of Pearson’s correlation. We repeat the process randomly permuting columns and rows to find the correlation. A lower value of Pearson’s correlation means a stronger relationship between the matrices.

The first research hypothesis states that the graph of companies resulting from a network analysis identify sectors. First, QAP analyses between the co-mention network and the sector affiliation network was carried out to examine whether the sector affiliation network can predict the structure of the co-mention network. We calculate the value of Pearson correlation using R package.

The QAP correlation analysis shows a significant correlation between the co-mention network and the stock exchange affiliation (\(r=0.053\), \(p=0.000\)) and the sector affiliation network (\(r=0.020\), \(p=0.000\)).

We exploited the QAP procedure for testing the significance of the correlation coefficients. Estimated density of QAP replications for the sector affiliation network is shown in Fig. 3. Similar results were also obtained for the network of co-mentions and the Stock exchange affiliation. The observed values of the correlation coefficients were higher than the model values in all simulated 500 samples. Thus, the observed correlation coefficients are statistically significant, while they are close to zero. This can be explained by the fact that the adjacency matrices were of large dimension and were sufficiently discharged. This could contribute to the underestimation of the correlation coefficient.

We evaluated the linear regression between the elements of the co-mention network matrix and the stock exchange affiliation matrix, as well as elements of the co-mention network matrix and the sector affiliation matrix. The values of the parameters found by the least squares method and p-value obtained by the QAP procedure are given in Table 8.

The QAP regression analysis shows (Table 8), that only \(0.2\%\) variance is predicted by the model (\(R^2=0.002\)). This value is relatively low, and indicates an insufficient inclusion of explanatory variables. The coefficients of the model are statistically significant.

Table 8. QAP regression analysis
Fig. 3.
figure 3

Estimated density of QAP replications for the sector affiliation network

The co-mention network of companies has similar structure to their territorial connections. We visualize the co-mention matrix using R package. We identify clusters using the level of link weights which is derived from co-mention frequencies. We use the sector affiliation matrix as attribute data to see if companies of one cluster have the attribute in common. Figure 4 shows a small part of network map of companies. Nasdaq companies with New York Stock Exchange and Dax companies with London Stock Exchange make two clusters. London companies along with GM (General Motors) are more likely to be a bridge for the US connection with Europe.

Fig. 4.
figure 4

Networked Map of companies

Big companies generate a much bigger news flow than small companies. For this reason, the co-mention matrix is dense for the largest companies and is much sparser for small companies. However, the overall spatial and sector affiliation of companies significantly affects the probability and frequency of co-mentions. In our opinion, the widening of the range of analyzed companies induces the sparsity of the co-mention matrix, which leads to a drop in the share of the explained variance.

4 Conclusion

In this article, we investigated the relationship between company co-mention network and the sector affiliation matrix. Moreover, we identified key companies in different sectors of the economy using various indicators of network analysis, such as frequency, normalized degree of centrality, closeness centrality, betweenness centrality and eigenvector centrality. We discovered that different network analysis indicators show different values for different companies. But some of the companies have high significance for all indicators considered. At the same time, the majority of leading (key) companies belong to the New York Stock exchange. It was shown that the distribution of degrees and clustering-degree relations for our network adheres to the power law, although with nonstandard indicators of exponent. QAP analysis showed the presence of significant positive correlation between company co-mention network and stock exchange affiliation, and between company co-mention network and sector affiliation network.