Keywords

1 Introduction

A standard approach to annotation of documents in a large corpus is to use content-based classifiers, e.g. Support Vector Machines (SVMs), specialised in the detection of specific topics [20]. In the case where the corpus grows over time, these classifiers are applied to all new documents. In the case where new classifications need to be added to the system, new classifiers need to be trained and added to the set. We are interested in the design of highly autonomous systems for the management of corpora, and as part of this effort we have developed the news monitoring infrastructure ‘News Outlets Analysis and Monitoring system’ (NOAM) [14]. As part of adding more autonomy and flexibility to that infrastructure, we have been investigating various ways to propagate annotation across documents in a corpus that does not involve training content-based classifiers. We are further interested in the situation where the annotation of a corpus improves with time, that is with receiving new labelled data. We want the accuracy of existing labels to improve with more data, where old errors in classification are possibly being amended, and if entirely new labels start being used in a data stream, the system will be able to accommodate them automatically and efficiently.

In this paper we explore the use of label propagation algorithms to label graph nodes, so that the knowledge of our system about the corpus is represented both in the topology of the graph and in the labels attached to its nodes. This also involves using scalable graph creation methods. The naïve approach to graph construction, comparing all documents to all documents or building a complete kernel matrix [28], will not work in large-scale systems due to high computational complexity. The cost of label propagation is also an important factor on the time needed to process incoming documents.

We present a method to propagate labels across documents by creating a sparse graph representation of the data, and then propagating labels along the edges of the graph. Much recent research has focused on methods for propagation of labels, taking for granted that the graph topology is given in advance [8, 18]. In reality, unless working with web pages, textual corpora rarely have a predefined graph structure. Graph construction alone has a worst case cost of \(\mathcal{O}({N}^{2})\) when using a naïve method due to the calculation of the full N  × N pairwise similarity matrix. Our proposed method can be performed efficiently by using an inverted index, and in this way the overall cost of the method has a time complexity of \(\mathcal{O}(N\log N)\) in the number of documents N.

We test our approach by creating a graph of 800,000 vertices using the Reuters RCV1 corpus [22], and we compare the quality of the label annotations obtained by majority voting against those obtained by using SVMs. We chose to compare the graph-based methods to SVMs because they are considered the state of the art for text categorisation [27]. We show that our approach is competitive with SVMs, and that the combination of our relation-based approach with SVMs leads to an improvement in performance over either of the methods individually. It is also important to notice that our methods can be easily distributed to multiple machines.

1.1 Related Work

There is a growing interest in the problem of propagating labels in graph structures. Previous work by Angelova and Weikum [2] extensively studied the propagation of labels in web graphs including a metric distance between labels, and assigning weights to web links based upon content similarity in the webpage documents. Many alternative label propagation algorithms have also been proposed over the years, with the survey [31] giving an overview of several different approaches cast in a regularisation framework. A common drawback of these approaches is the prohibitively high cost associated with label propagation. A number of recent works on label propagation [8, 9, 18] concentrate on extracting a tree from the graph, using a very small number of the neighbours for each node. While many graph-based methods do not address the problem of the initial graph construction, assuming a fully connected graph is given, or simply choosing to work on data that inherently has a graph structure, there is a large number of papers dedicated to calculating the nearest neighbours of a data point. One such approximate method, NN-Descent [13], shows promising results in terms of accuracy and speed for constructing k-Nearest Neighbour graphs, based upon the principle that ‘a neighbour of a neighbour is also likely to be a neighbour’. The All-Pairs algorithm [5] tackles the problem of computing the pairwise similarity matrix often used as the input graph structure in an efficient and exact manner, showing speed improvements over another inverted-list-based approach, ProbeOpt-sort [26] and well-known signature-based methods such as Locality Sensitive Hashing (LSH) [15]. In this paper we take a broader overview, considering both the task of creating a graph from text documents, and then propagating labels for text categorisation simultaneously. We are interested in an approach that can be applied to textual streams, with the previously mentioned additional benefits offered by moving away from classical content-based classifiers. This paper is an extended version of the paper [21].

2 Graph Construction

Graph construction \(\mathcal{X} \rightarrow \mathcal{G}\) deals with taking a corpus \(\mathcal{X} = \left \{{x}_{1},\ldots,{x}_{n}\right \}\), and creating a graph \(\mathcal{G} = (V,E,W)\), where V is the set of vertices with document x i being represented by the vertex v i , E is the set of edges, and W is the edge weight matrix. There are several ways the construction can be adapted, namely the choice of distance metric and the method for maintaining sparsity.

The distance metric is used to determine the edge weight matrix W. The weight of an edge w ij encodes the similarity between the two vertices v i and v j . The choice of metric used is mostly task dependent, relying on an appropriate selection being made based upon the type of data in \(\mathcal{X}\). A common measure used for text, such as cosine similarity [24], may not be appropriate for other data types, such as when dealing with histogram data where the χ2 distance is more meaningful [30].

Typically, a method for maintaining sparsity is required since it is not desirable to work with fully connected graphs for reasons of efficiency, and susceptibility to noise in the data [19]. This can be solved by working with sparse graphs, which are easier to process. Two popular methods for achieving sparsity include k-nearest neighbour (kNN) and ε-neighbourhood, both utilizing the local neighbourhood properties of each vertex in the graph [7, 19, 23]. Local neighbourhood methods are important for efficiency since a data point only relies on information about other points close by, with respect to the distance metric, to determine the neighbours of a vertex. This means that no global properties of the graph need to be calculated over the entire graph each time a new vertex is added, a consideration that has implications both for the scalability and, more generally, for the parallelisation.

The first step of creating the graph usually involves calculating the pairwise similarity score between all pairs of vertices in the graph using the appropriately chosen distance metric. Many studies assume that it is feasible to create a full N ×N distance matrix [19] or that a graph is already given [8, 18]. This assumption can severely limit the size of data that is managable, limited by the \(\mathcal{O}({N}^{2})\) time complexity for pairwise calculation. Construction of a full graph Laplacian kernel, as required by standard graph labelling methods [6, 17, 32] is already computationally challenging for graphs with 10,000 vertices [18]. Jebara et al. [19] introduce β-matching, an interesting method of graph sparsification where each vertex has a fixed degree β and show an improved performance over k-nearest neighbour, but at a cost to the complexity of the solution and the assumption that a fully connected graph is given.

We can overcome the issue of \(\mathcal{O}({N}^{2})\) time complexity for computing the similarity matrix by using an alternative method, converting the corpus into an inverted index where each term has a pointer to the documents the term appears within. The advantage of this approach is that the corpus is mapped into a space based upon the number of terms, rather than the number of documents. This assumption relies on the size of the vocabulary | t | being much smaller than the size of the corpus. According to Heaps’ Law, the number of terms | t | appearing in a corpus grows as \(\mathcal{O}({N}^{\beta })\), where β is a constant between 0 and 1 dependent on the text [16]. Some experiments on English text have shown that in practice β is between 0. 4 and 0. 6 [3, 4]. The inverted index can be built in \(\mathcal{O}(N{L}_{d})\) time where L d is the average number of terms in a document, with a space complexity of \(\mathcal{O}(N{L}_{v})\) where L v is the average number of unique terms per document [29].

Finding the neighbours of a document is also trivial because of the inverted index structure. A classical approach is to use the Term Frequency-Inverse Document Frequency (TF-IDF) weighting [24] to calculate the cosine similarity between two documents. This can be performed in \(\mathcal{O}({L}_{d}\log \vert t\vert )\) time for each document by performing L d binary searches over the inverted index. Assuming β from Heaps’ Law is the average value of 0. 5, the time complexity for finding the neighbours of a document can be rewritten as \(\mathcal{O}(\frac{{L}_{d}} {2} \log N)\). Therefore, there is a total time complexity \(\mathcal{O}(N + \frac{N{L}_{d}} {2} \log N)\) for building the index and finding the neighbours of all vertices in the graph. This is equivalent to \(\mathcal{O}(N\log N)\) under the assumption that the average document length L d is constant.

A further advantage of this method is that the number of edges per vertex is limited a priori, since it is infeasible to return the similarity with all documents in the inverted index for every document. This allows the construction of graphs that are already sparse, rather than performing graph sparsification to obtain a sparse graph from the fully connected graph, e.g. [19].

We investigate two popular local neighbourhood methods, k-nearest neighbour (kNN) and ε-neighbourhood, for keeping the graph sparse during the initial construction phase and also when new vertices are added to the graph [7, 19, 23]. Figure 1 shows intuitively how each of the methods chooses the edges to add for a given vertex. The first method, kNN, connects each vertex to the k most similar vertices in V, excluding itself. That is, for two vertices v i and v j , an edge is added if and only if the similarity between v i and v j is within the largest k results for vertex v i . The second method we investigate, ε-neighbourhood, connects all vertices within a distance ε of each other, a similar approach to classical Parzen windows in machine learning [25]. This places a lower bound on the similarity between any two neighbouring vertices, i.e. only edges with a weight above the threshold ε are added to the graph. A simple way of visualising this is by drawing a sphere around each vertex with radius ε, where any vertex falling within the sphere is a neighbour of the vertex. While the first method fixes the degree distribution of the network, the second does not, resulting in fundamentally different topologies. We will investigate the effect of these topologies on labelling accuracy.

Fig. 1
figure 1

Illustration of an example where two graphs, \({\mathcal{G}}_{1}\) and \({\mathcal{G}}_{2}\), are being constructed using the two methods we investigate: k-nearest neighbour and ε-neighbourhood. In the example, the possible edges for vertex v 1 are being considered. The k-nearest neighbour method ranks the closeness of the adjacent vertices with respect to a given similarity measure, then adds edges to the closest k vertices. For this example, k = 2. The ε-neighbourhood method adds all edges which connect v 1 to a vertex inside the large circle which visualises the radius ε

3 Label Propagation

Label propagation aims to use a graph \(\mathcal{G} = (V,E,W)\) to propagate labels from labelled vertices to unlabelled vertices. Each vertex v i can have multiple labels, i.e. a document can have multiple annotations, and each label is considered independently of the other labels assigned to a vertex. The labels assigned to the set of labelled vertices \({\mathcal{Y}}_{l} = \left \{{y}_{1},\ldots,{y}_{l}\right \}\) are used to estimate the labels \({\mathcal{Y}}_{u} = \left \{{y}_{l+1},\ldots,{y}_{l+u}\right \}\) on the unlabelled set.

Carreira-Perpinan et al. [7] suggest constructing graphs from ensembles of minimum spanning trees (MST) as part of their label propagation algorithm, with their two methods Perturbed MSTs (PMSTs) and Disjoint MSTs (DMSTs), having a complexity of approximately \(\mathcal{O}(T{N}^{2}\log N)\) and \(\mathcal{O}({N}^{2}(\log N + t))\) respectively, where N is the number of vertices, T is the number of MSTs ensembled in PMSTs, and t is the number of MSTs used in DMSTs, typically \(t << \frac{N} {2}\). However, to the best of the authors’ knowledge, no studies have performed experiments on constructed graphs with more than several thousand vertices, with the exception of Herbster et al. [18] who build a shortest path tree (SPT) and MST for a graph with 400,000 vertices from web pages. Herbster et al. [18] also note that constructing their MST and SPT trees using Prim and Dijkstra algorithms [11], respectively, takes \(\mathcal{O}(N\log N + \vert E\vert )\) time, with the general case of a non-sparse graph having a time complexity of Θ(N 2).

In this paper we adopt Online Majority Voting (OMV) [8], a natural adaptation of the Label Propagation (LP) algorithm [32], as our algorithm for the label propagation step due to its efficiency, simplicity, and experimental performance [1]. OMV is based closely upon the locality assumption that vertices that are close to one another, with respect to a distance or measure, should have similar labels. Each vertex is sequentially labelled as the unweighted majority vote on all labels from the neighbouring vertices. The time complexity for OMV is Θ( | E | ), a notable reduction from the \(\mathcal{O}(k{N}^{2})\) required for LP algorithm, where k is the neighbours per vertex. The complexity being dependent on the number of edges in the graph further benefits from the a priori limit we impose upon the maximum edges per vertex, ensuring that | E |  = bN for some maximum edge limit b.

4 Experiments and Evaluation

We present an experimental study of the feasibility of our approach on a large dataset, the Reuters RCV1 corpus [22]. We split the corpus into a training and test set, where the test set is the last 7 weeks of the corpus, and the training set covers everything else. The test set is further subdivided into 7 test weeks. The performance was evaluated using the { F}1 Score, which is the harmonic mean of the precision and recall, a widely used performance metric for classification tasks [24]. We evaluate the performance of each method on the 7 test weeks, where all previous weeks have already been added to the graph, to simulate an online learning environment. The { F}1 Scores reported are the mean performance, over the 50 most common topics, averaged over the 7 test weeks.

4.1 Graph-Based Content Classification

For the graph-based methods, the hyperparameters k and ε require careful selection in order to achieve comparable performance with current methods. This is the most expensive step as it often requires a search of the parameter space for the best value. We use Leave-One-Out Cross Validation (LOO-CV) on the training set to tune the parameters. This involves constructing graphs for a range of values of ε and k on the training set by iterating over all vertices, predicting the labels of the vertex based upon the majority vote of its neighbours. The predictions are checked against the true labels, with the highest performing parameter value being chosen. The performance for values of ε and k can be seen in Fig. 2(a) and 2(b). The best parameters for each topic individually were also recorded, allowing for a multi-parameter graph where each topic label uses a different parameter value. This could informally be thought of as each label being able to travel a certain distance along each edge. Figure 3(a) and 3(b) show the performance difference on each topic between using the general parameter values ε = 0. 4 and k = 5 for every topic, and using the optimal value found for each topic individually. It can be seen that for some topics a small increase in performance can be achieved, but the performance gain is minimal (with some loss for ε-Neighbourhood) at the expense of constructing multiple graphs, and so this approach is not considered further. Figure 4 shows a direct comparison of the graph-based methods with each other. Out of the 50 most common topics, kNN has a higher performance on 46 of the possible 50 topics. Clearly, ε-Neighbourhood is the weaker of the graph-based methods.

Fig. 2
figure 2

{ F}1 performance for the 50 most common topics evaluated on the training set using LOO-CV for (a) \(\epsilon= \left \{0,0.01,\ldots,1.00\right \}\) and (b) \(k = \left \{1,2,\ldots,100\right \}\). It can be seen that there is a peak at ε = 0. 4 and k = 5

Fig. 3
figure 3

Comparison of the mean { F}1 Score, averaged over all test set weeks, for the graph-based methods with a single best parameter against a multi-parameter approach on the 50 most common topics. Points below the diagonal line indicate when the single parameter method achieved a higher performance, with points above the diagonal line indicating that the multi-parameter method achieved a higher performance on that topic

Fig. 4
figure 4

Comparison of the mean { F}1 Score, averaged over all test set weeks, for the graph-based methods on the 50 most common topics. Points below the diagonal line indicate when ε-Neighbourhood achieved a higher performance than kNN, with points above the diagonal line indicating that kNN achieved a higher performance than ε-Neighbourhood on that topic

4.2 Comparison with Content-Based SVMs Performance

A comparison of the graph-based methods with the current state of the art in content-based classification was performed. The SVMs were deployed using the LibSVM toolbox [10]. We trained one SVM per topic using the Cosine kernel, which is a normalised version of the Linear kernel [28]. For each topic, training used a randomly selected 10, 000 positive examples, and a randomly selected 10, 000 negative examples picked from the training set. The examples were first stemmed and stop words were removed as for the graph-based methods. The last week of the training corpus was used as a validation set to empirically tune the regularisation parameter C out of the set [0.01, 0.05, 0.1, 0.5, 1, 5, 10, 100]. For each topic, C was tuned by setting it to the value achieving the highest { F}1 performance on that topic in the validation set. Figure 5(a) and 5(b) show a comparison of the graph-based methods with SVMs. Out of the 50 most common topics, SVM achieved a higher performance than ε-Neighbourhood on 29 topics, but only beat kNN on 19 of the topics, that is kNN performed better than SVMs on 31 out of the 50 topics. This shows that the graph-based methods are competitive with the performance of SVMs.

Fig. 5
figure 5

Comparison of the mean { F}1 Score, averaged over all test set weeks, for (a) ε-Neighbourhood and (b) k-NN against SVMs on the 50 most common topics. Points below the diagonal line indicate when SVMs achieved a higher performance than the graph-based method, with points above the diagonal line indicating that the graph-based method achieved a higher performance than SVMs on that topic

4.3 Building Ensembles of Graph-Based and Content-Based Approaches

Further to the comparison of the graph-based methods with SVMs, an ensemble [12] of the graph-based and content-based classification methods was evaluated. For each vertex, a majority vote for each class label c is taken by counting the supporting votes from k votes of the kNN method, supplemented with s votes from the SVMs for a total of \(\upsilon= k + s\) votes. That is, each vertex has the k votes from the kNN method, but also s votes assigned by the SVMs. The number of votes from the SVM is chosen in the interval \(s = [0,k + 1]\). This moves the combination method from purely graph-based at s = 0 (υ = k), to purely content-based at \(s = k + 1\) (\(\upsilon= 2k + 1\)).

Given a set of p class labels \(C =\{ {c}_{1},{c}_{2},\ldots,{c}_{p}\}\), a set of n vertices \(V =\{ {v}_{1},{v}_{2},\ldots,{v}_{n}\}\), a graph matrix A ∈ { 0, 1}n ×n where A i, j indicates whether v j is a neighbour of v i , a label matrix Y ∈ { 0, 1}n ×p where Y j, c indicates if vertex v j has class label c, an SVM assigned label matrix S ∈ { 0, 1}n ×p where S i, c indicates if class label c has been assigned to vertex v i by the SVMs and a regularisation parameter λ = [0, 1], \(\widetilde{{Y }}_{i,c}\) is the decision whether label c is to be assigned to vertex v i . Formally, a linear combination of the methods was created as

$$\widetilde{{Y }}_{i,c} = \theta \left (\lambda \sum\limits_{j}({A}_{i,j}{Y }_{j,c}) + (1 - \lambda ){S}_{i,c}\right )$$
(1)
$$\theta (x) = \left \{\begin{array}{l l} 1&\quad \mathrm{if}\ x > \frac{\upsilon } {2} \\ 0&\quad \text{ otherwise}\\ \end{array} \right.$$
(2)

Equation 1 can be reformulated so that it is easier to interpret by setting \(\mu= \frac{1-\lambda } {\lambda }\), giving

$$\widehat{{Y }}_{i,c} = \theta \left (\sum\limits_{j}({A}_{i,j}{Y }_{j,c}) + \mu {S}_{i,c}\right )$$
(3)

where μ represents the number of SVM votes s in the interval [0, k + 1].

For our experiments, the value of μ for combining the kNN and SVM methods was evaluated between 0 and 6 since the kNN method uses k = 5 neighbours.

Next, we consider the best value of μ for combining the kNN methods and SVMs in a linear combination. Figure 6 shows the performance of the combined method averaged over the 50 most common topics for each value of μ. Out of the 50 most common topics, the combined method with μ = 4 provided an improvement over the performance of both the SVM and kNN methods for 36 of the topics. Using μ = 1 showed an improvement over both methods for the greatest number of topics, with 38 of the 50 topics seeing an improvement. The mean performance of the combined method with μ = 1 is lower than for μ = 4 however, indicating that when μ = 4 the improvements are greater on average, even if there are slightly fewer of them.

Fig. 6
figure 6

Comparison of the mean { F}1 Score, averaged over all test set weeks, for the combined method at different μ values on the 50 most common topics. It can be seen that the combined method offers an improvement over the kNN approach (μ = 0) and the SVM approach (μ = 6)

When comparing the combined method with SVM and kNN as seen in Fig. 7(a) and 7(b) respectively, the performance of the combined method was higher than SVM on 45 of the 50 topics and higher than kNN on 41 out of the 50 topics. This shows that the combined method does not only improve on SVM and kNN on average, but also provides an improvement for 90% and 82% of the 50 most common topics, respectively. It should be noted that in the cases where the combined method does not provide an improvement on one of the methods, it does still have a higher performance than the lowest performing method for that topic. That is, there were no cases where combining the methods gives a performance below both of the methods individually.

Fig. 7
figure 7

Comparison of the mean { F}1 Score, averaged over all test set weeks, for the combined method using μ = 4 against (a) SVMs and (b) kNN on the 50 most common topics. Points below the diagonal line indicate when the combined method achieved a higher performance, with points above the diagonal line indicating that the individual method achieved a higher performance than the combined method on that topic

A summary of the overall performance of each method can be seen in Fig. 8. The ε-Neighbourhood method is the weaker of the two methods proposed with a performance of 62.2%, while the kNN method achieved a performance of 65.9%, beating the 64.5% for SVMs. Combining the kNN and SVM methods reached the highest performance at 71.4% with μ = 4, showing that combining the relation-based and content-based approaches is an effective way to improve performance.

Fig. 8
figure 8

Summary of the mean { F}1 Score, averaged over all test set weeks, for the graph-based methods and SVMs along with the best combined method (μ = 4) on the 50 most common topics. It can be seen that the graph-based methods are comparable with SVMs, with the combined method showing a further improvement. It should be noted that the performance of the combined method is slightly bias due to selecting for the best μ. ε-Neighbourhood has been abbreviated to εN

5 Conclusion

We have investigated a scalable method for annotating a large and growing corpus. This is achieved by efficiently creating a sparse graph and propagating the labels along its edges. In our case study the edges were created by using bag of words similarity, but potentially we could use any other measure that is correlated to the labels being propagated. There has been an increased theoretical interest on methods of label propagation, and some interest on how graph construction interplays with the propagation algorithms. Findings suggest that the method of graph construction cannot be studied independently of the subsequent algorithms applied to the graph [23]. We claim that label propagation has many advantages over the traditional content-based approach such as SVMs. New labels that are introduced into the system can be adopted with relative ease, and will automatically begin to be propagated through the graph. In contrast, a new SVM classifier would need to be completely trained to classify documents with the new class label. A second advantage of label propagation is that incorrectly annotated documents can be reclassified based upon new documents in a self-regulating way. That is, the graph is continuously learning from new data and improving its quality of annotation, while the SVM is fixed in its classification after the initial training period. In this paper, we have investigated two different local neighbourhood methods, ε-Neighbourhood and k-Nearest Neighbour, for constructing graphs for text. We have shown that sparse graphs can be constructed from large text corpora in \(\mathcal{O}(N\log N)\) time, with the cost of propagating labels on the graph linear in the size of the graph, i.e. \(\mathcal{O}(N)\). Our results show that the graph-based methods are competitive with content-based SVM methods. We have further shown that combining the graph-based and content-based methods leads to an improvement in performance. The proposed methods can easily be scaled out into a distributed setting using currently available open source software such as Apache Solr, or Katta, allowing a user to handle millions of texts with similarly effective performance. Research into novel ways of combining the relation and content-based methods could lead to further improvements in the categorisation performance while keeping the cost of building and propagating labels on the graph to a minimum.