Keywords

1 Introduction

News agencies, organizations, social networks, enterprises, etc. generate a huge amount of information and news in real time. The study of the characteristics and features of the news flow has become one of the recent topics in social informatics. The news stream is publicly available and reflects processes which are going on in the world. Therefore, the examination of structural properties of the news flow and the development of the methods and algorithms for processing such data are of great interest.

In this paper we restrict our analysis to the study of economic and financial news. News analytics providers (such as Media Sentiment, Thompson Reuters and Raven Pack) collect, aggregate and pre-process news from different sources including news agencies and social media (blogs, social networks, etc.). They also use so-called pre-news, i.e. SEC reports, court documents, reports of various government agencies, business resources, company reports, announcements, industrial and macroeconomic statistics. The survey of applications for news analytics tools can be found in books [14, 15]. In recent years news analytics tools have been developed and used in social network analysis [4, 9, 12, 18].

Thus, the huge amount of news analytics data enables the study of some research problems. One of such problem is the analysis of company co-mention network which has been addressed in [20, 23]. This paper concerns with another such problem which is event detection. Note that news analytic data raise advantages for event detection. Using the real-time nature of news analytic services, both traders and governments may boost their awareness of real world events much quicker than with offline analysis based on traditional media sources.

Our research interest is close to the problem of event detection in Twitter. It should be noted that event detection in Twitter has long been a research topic [24] and the main problem has been detecting those events that Twitter users are discussing most. Most of the researches supposed that an increase in the usage of some related words would show when an event is happening. An event is therefore commonly characterized by a number of keywords displaying sharp burst in appearance count [10, 24]. Since then the problem of event detection in Twitter has drawn a serious amount of enthusiasm from the data mining researchers [1, 2, 5, 6, 17]. The common understanding is that events in social media platforms can be loosely defined as real world happenings that occur within similar time periods and geographical locations, and that have been mentioned by the online users. Recent works include [3, 7, 8, 11, 16], among many others.

In this paper, we use an approach based on the ideas of the paper [24]. We use news analytics data for detecting events that are of similar scales and localized in time and referred to a particular company. Thus, we employ the characteristic of news items which are reflect localization in time and the similarity of their attributes.

2 Events

In this paper we define events as real world happenings that are reflected by news items that are concentrated in time and connected with a particular company or a group of companies, or an economic sector. We suppose that an important off-line event attracts interest of news and information agencies as well as other news providers and news exchange participants. The more important the event, the more intense will be the corresponding news flow. We assume that an important event should cause a cascade of news. We define a cascade of news as the sequence of news items published by various news providers, which are published in the same time frame and related to one company and have one topic. As a rule, such a cascade has an initial news message (trigger), which triggers this cascade. If some news turned out to be important, other news providers and news agencies reprint this news close to the original form. In addition, they can publish news items related to the development of events caused by this news. If a published news item seems to be not important, it may be reprinted by a very few news providers, or it may not be reprinted at all. The news that did not provoke interest and did not lead to a cascade of news, in our opinion, is hardly connected with an important event. Therefore, the detection of such cascades in the flow of news items allows us to detect events in one way or another. Each such cascade is a component of connectivity in the graph of news reports. Thus, the identification of cascades and their corresponding events is reduced to the task of finding the related subgraphs (clusters) of the news graph. It can be assumed that the more news are in the cascade, the more significant the corresponding event is.

Of course, different events can exhibit different temporal scales, i.e. they can span different intervals in time.

In this paper we consider a news analytics data stream with temporal and text information. The goal of this paper is to employ an approach to identify events using the stream data of news analytics providers. We treat event detection as a graph-based clustering problem, where the vertices of the graph represent the news, and the edges reflect their similarities.

3 Event Detection via Temporal and Attributive Constraints

Economic and financial events presented in news flow may have different localization behavior in time and can have different attribution to a particular company (group of companies, sector, country). When the events are localized in the two dimensions (temporal and attributive), event detection can be adequately implemented by imposing temporal and attributive constraints on the data. In this section, we detail an approach for detecting economic and financial events that are localized both in time and attribution. It can be reduced to a clustering problem, where we assemble together the news items that describe the same real world event. We define the similarity measure between different news items \(n_i\) and \(n_j\) as follows:

$$\begin{aligned} S(n_i,n_j) = {\left\{ \begin{array}{ll} 1,\,\, \text {if}\ t(n_i,n_j)\le T_t\, \text {and}\, A(n_i)=A(n_j),\\ 0,\, \text {otherwise}. \end{array}\right. } \end{aligned}$$
(1)

where \(t(n_i,n_j)\) is the temporal difference in minutes between \(n_i\) and \(n_j\) and A(n) is the value of attribute A (company name and news topic) for news item n.

The threshold \(T_t\) enforces strict temporal constraints of the events and A implements the shared attributes. Under such constraints, two news items \(n_i\) and \(n_j\) seem to refer to the same economical and financial event.

The graph connectivity components are found by means of the depth search approach. The main drawback of this clustering algorithm on big graphs is the high demand on the amount of RAM. However, the problem is hugely reduced due to an eminent sparsity of our graph.

Different approaches to find similarity measure for twits have been used in papers [6, 13].

Let \(S(n_i,n_j)\) be the pairwise similarity between news items defined in (1). Let us define matrix W as follows:

$$\begin{aligned} W(n_i,n_j) = {\left\{ \begin{array}{ll} S(n_i,n_j),\, \text {if}\, i\ne j,\\ 0,\, \text {if}\ i=j. \end{array}\right. } \end{aligned}$$
(2)

Matrix W is a symmetric sparse matrix. We can construct an sparse undirected \(G = (V, E,W)\) for which W is the adjacency matrix. The vertices V of the graph will represent news items, \(E = V\times V\) is the edge set. An edge connects two news if they are about the same company and have the same topic. If \(W(n_i,n_j)\ne 0\) then there is an edge between two vertices \(v_i\) and \(v_j\) and the weight \(w_{ij} = 1\).

With such a graph representation of W, event detection can then be reduced to a graph partitioning task, i.e. to divide the graph G into subgraphs. Our goal is to subdivide the vertices of the graph into disjoint clusters in such a way that each cluster would contain news items that probably correspond to the same economical or finance event. The constraints in (1) imply that the events should be localized in time and be attributed to the same company (or group of companies, or to the same economical sector, country).

figure a

4 Empirical Result

We now test the performance of Algorithm 1 for real world event detection problems. We first describe the data and some implementation details, and then present the event detection results.

4.1 Data Description

Providers of news analytics obtain and aggregate data from different sources (including news agencies and business reports) and social media. Our data cover the period from January 1, 2015 to January 31, 2015 (i.e. 22 trading days). We consider all the news released during this period. Initially we performed data selection and cleaning process.

Descriptive statistics of news flow can be found in Table 1. The total amount of news items issued in January was 134199. Table 1 shows that the average amount of news published per day was 4329. There were days with the very low intensity of news flow (e.g. only 30 economic and financial news items were published on January, 1 of 2015), while some days exhibit a much higher amount of published news items (e.g. 10238 news items were issued on January, 29 of 2015).

Table 1. Descriptive statistics of news flow in January, 2015

Figure 1 shows that the average amount of holiday news items is much less than that of holidays. Therefore, we eliminate all holiday news from the study. The news flow is highly noisy and time series of news flow data exhibit self-similarity and fractality [19, 21, 22]. Examining the problem of handling the noisy information in the Twitter data, the paper [6] uses a homogeneous Poisson process as a statistical model.

Fig. 1.
figure 1

Daily dynamics of the news flow in January, 2015

4.2 Event Detection Results

We would like to group similar news items into the same cluster in such a way that they could correspond to a real world event. We implement the event detection algorithm and obtain empirical results for different values of temporal threshold \(\theta = 0.5, 1, 3, 6, 12\) h for the detection of event clusters. The results are presented in Tables 2 and 3. Table 2 exhibits the distribution of amount of cascades with dependence to the size of the cascade for different values of \(\theta \). Table 3 presents the sizes of connectivity components and temporal durations of corresponding news cascades for different values of \(\theta \).

As can be seen from Table 2, the distribution of the sizes of connectivity components is fairly stable regardless of the parameter \(\theta \). At the same time, Table 3 shows that most of the connectivity components correspond to small-sized cascades (ranging from 2 to 10 news). This small-sized cascades can not reflect an event. It is obvious that an increase of the parameter \(\theta \) must lead to the appearance of larger-sized connectivity components, and corresponding cascades may last the entire period under consideration. In our opinion, taking the value of \(\theta \) more than 12 h could lead to the unification of the whole news flow for a quite large company in one cascade. In our opinion, if an event occurred, then the discussion of it in the news flow will occur quite intensively. At the same time, the continuity of the working day of the world financial market will not allow to forget the event, and news will be published in Asian, European and American financial markets, and so on. Therefore, in our opinion, in order to detect events it is quite enough to take \(\theta \) in the interval from 3 to 6 h.

Table 2. The amount of cascades N and the average duration of cascades T with a given size (the amount of news items in cascade), for different values of \(\theta \)
Table 3. Characteristics of connectivity components’ sizes and temporal durations of corresponding news cascades for different values of \(\theta \)

The analysis of the clustering results for Algorithm 1 show that the clusters detected by the algorithm correspond to meaningful economic and financial real world events of interest.

Below we describe a few cases of news cascades found by Algorithm 1.

  1. (1)

    Yahoo Inc. decided to transfer its 15% of the Chinese Alibaba Group shares to a separate company. At the closure of trading on January 26th the shares’ value was estimated at $40 billion, as follows from a message on the company’s website. The cascade began at 28.01.2015 02:36:07 and lasted till 28.01.2015 18:04:04. Duration 15 h 27 min 57 s. The number of news in the cascade is 22. The first news was printed by Forbes.

  2. (2)

    Facebook reports fourth quarter and full year 2014 results. Its users’ community was continuously growing. Revenue for the fourth quarter of 2014 totaled $3.85 billion, an increase of 49%, compared with $2.59 billion in the fourth quarter of 2013. The cascade began at 21.01.2015 21:01:12 and lasted till 22.01.2015 15:17:24. Duration 18 h 16 min 12 s. The number of news in the cascade is 48. The news was printed by Facebook.

  3. (3)

    Netflix ended 2014 with 57.4 million subscribers worldwide topping expectations for overseas growth. The No. 1 subscription video-on-demand provider reported quarterly revenue of $1.48 billion, up 26% year over year. Netflix shares were up more than 13% in after-hours trading, to $395.15 per share, on the results. The cascade began at 20.01.2015 21:01:11 and lasted till 21.01.2015 15:21:30. Duration 18 h 20 min 29 s. The number of news in the cascade is 52. The first news was printed by Variety.

  4. (4)

    On the morning of January 28, 2015 at 05:19:31 the BBC news service informed about record Apple’s income. Apple reported about its quarterly profit of 18 billion dollars. It is a historical record for a public company. After that, the price of Apple shares increased by 5%. This statement aroused a news cascade with duration of 12 h 14 min 38 s. The cascade contained 25 news items.

  5. (5)

    Tesco PLC is a British company and one of the oldest transnational trade networks. Tesco PLC owns a large trade network in British. This company controls about 30% retail of foodstuffs. On the morning of January 8, 2015 at 7:00:28 one of the largest American publications The Wall Street Journal the news was published about lower prices and closing unprofitable stores. Tesco PLC planned to close about 43 unprofitable stores and to sell a range of assets. After that, Tesco’s shares increased at once by 14,97% and have reached 2.09 pounds. This information aroused a news cascade with duration of 21 h 31 min. The cascade contained 66 news items (Fig. 2).

Fig. 2.
figure 2

The histogram for the amount of connectivity components (with 11 and more items), threshold \(\theta = 3\) h

5 Conclusion

In this paper we suppose that an important economic and financial event should lead to a cascade of news. By this reason, the mining of such cascades in the news flow may be exploited to detect events. Each such cascade forms a cluster in the graph of news reports. Therefore, the mining of cascades (and associated events) can be reduced to the problem of finding the corresponding clusters in the graph. The clustering algorithm employs a similarity measure between news items that uses both the temporal and textual features of a news item. It is supposed that two news items that are generated by news agencies that describe the same event should be closely connected with a particular company, or a group of companies, or economic sector, or country. In this paper, to measure the similarity between news we use approach based on temporal and affiliation constraints. Empirical results on real world data collected from news analytics providers exhibit the capability and the efficiency of the event detection approach.