1 Introduction

The widespread diffusion of new communication technologies, such as the Internet together with the development of intelligent artificial systems capable of producing and sharing different kinds of data, have led to a dramatic increasing of the number of available information. One of the main goal in this context is to transform heterogeneous and unstructured data into useful and meaningful information through the use of Big Data, deep neural networks and the myriad of applications that derive from their implementations. For this purpose, documents categorization and classification is an essential task in the information retrieval domain, strongly affecting user perception [16]. The goal of classification is to associate one or more classes to a document, easing the management of a document collection. The techniques used to classify a document have been widely applied to different contexts paying attention to the semantic relationships especially between terms and the represented concepts [27, 28]. The use of semantics in the document categorization task has allowed a more accurate detection of topics concerning classical approaches based on raw text and meaningless label [24]. Techniques relying on semantic analysis are often based on the idea of semantic network (SN) [32]. Woods [37] highlighted the lack of a rigorous definition for semantic networks and their conceptual role. In the frame of this work, we will refer to a semantic network as a graph entity which contains information about semantic and/or linguistic relationships holding between several concepts. Lately, semantic networks have been often associated to ontologies which are now a keystone in the field of knowledge representation, integration and acquisition [26, 29,30,31]. Moreover, ontologies are designed to be machine-readable and machine-processable. Over the years, the scientific community provided many definitions of ontologies. One of the most accepted is in [11]. It is possible to represent ontologies into graphs and vice versa, with this duality making them interchangeable. The use of graphs and analysis metrics permits us to have a fast retrieval of information and for finding new patterns of knowledge hard to recognize. Topic detection and categorization are crucial task which allows quick access to contents in a document collections when used in an automatic way. A disadvantage of many classification methods is that they treat the categorization structure without considering the relationships between categories. A much better approach is to consider that structures, either hierarchical or taxonomic, constitute the most natural way in which concepts, subjects or categories are organized in practice [1].

The novelty of the proposed work has to be found in the way we combine statistical information and natural language processing. In particular, the approache uses an algorithm for word sense disambiguation based on semantic analysis, ontologies and semantic similarity metrics. The core is a knowledge graph which represents our semantic networks (i.e. ontology). It is used as a primary source for extracting further information. It is implemented by means of a NoSQL technology to perform a “semantic topic detection”.

The paper is organized as follows: in Sect. 2 we provide a review of the literature related to Topic Modeling and Topic Detection techniques and technologies; Sect. 3 introduces the approach along with the general architecture of the system and the proposed textual classification methodology; in Sect. 4 we present and discuss the experimental strategy and results; lastly, Sect. 5 is devoted to conclusions and future research.

2 Related Works

This section analyzes relevant recent works related to textual topic detection, as well as the differences between our approach and the described ones. Over the years, the scientific community proposed several methodologies, hare grouped according to the main technique used. The goal of approaches based on statistics is to identify the relevance of a term based on some statistical properties, such as TF-IDF [33], N-Grams [8], etc. Topic modeling [20] instead is an innovative and widespread analytical method for the extraction of co-occurring lexical clusters in a documentary collection. In particular, it makes use of an ensemble of unsupervised text mining techniques, where the approaches are based on probabilities. The authors in [21] described a probabilistic approach for Web page classification, they propose a dynamic and hierarchical classification system that is capable of adding new categories, organizing the Web pages into a tree structure and classifying them by searching through only one path of the tree structure. Other approaches use features based on linguistic, syntactic, semantic, lexical properties. Hence, they are named linguistic approaches. Similarity functions are employed to extract representative keywords. Different machine learning techniques, such as Support Vector Machine [35], Naive Bayes [39] and others are used. The keyword extraction is the result of a trained model able to predict significant keywords. Other approaches attempt to combine the above-cited ones in several ways. Other parameters such as word position, layout feature, HTML tags, etc. are also used. In [13], the authors use an approach based on machine learning techniques in combination with semantic information, while in [18] co-occurrence is employed for the derivation of keywords from a single document. In [12], the authors use linguistic features to represent term relevance considering the position of a term in the document and other researches [25] build models of semantic graphs for representing documents. In [36], the authors presented an iterative approach for keywords extraction considering relations at different document levels (words, sentences, topics). With such an approach a graph containing relationships between different nodes is created, then the score of each keyword is computed through an iterative algorithm. In [2], the authors analyzed probabilistic models for topic extraction. Xu et al. [38] centered their research on topic detection and tracking but focusing on online news texts. The authors propose a method for the evolution of news topics over time in order to track topics in the news text set. First, topics are extracted with LDA (latent Dirichlet allocation) model from news texts and the Gibbs Sampling method is used to define parameters. In [34] an extended LDA topic model based on the occurrence of topic dependencies is used for spam detection in short text segments of web forums online discussions. Khalid et al. [14] use parallel dirichlet allocation model and elbow method for topic detection from conversational dialogue corpus. Bodrunova et al. [5] propose an approach based on sentence embeddings and agglomerative clustering by Ward’s method. The Markov stopping moment is used for optimal clustering. Prabowo et al. [22] describe a strategy to enhance a system called ACE (Automatic Classification Engine) using ontologies. The authors focus on the use of ontologies for classifying Web pages concerning the Dewey Decimal Classification (DDC) and Library of Congress Classification (LCC) schemes using weighted terms in the Web pages and the structure of domain ontologies. The association between significant conceptual instances into their associated class representative(s) is performed using an ontology classification scheme mapping and a feed-forward network model. The use of ontologies is also explored in [17]. The authors propose a method for topic detection and tracking based on an event ontology that provides event classes hierarchy based on domain common sense.

In this paper, we propose a semantic approach for document classification. The main differences between our approach and the other presented so far are in the proposing of a novel algorithm for topic detection based on semantic information extracted from a general knowledge base for representing the user domains of interest and the fully automatization of our process without a learning step.

3 The Proposed Approach

In this section, we provide a detailed description of our approach for topic detection. The main feature of our methodology is its ability to combine both statistical information, natural language processing and several technologies to categorize documents using a comprehensive semantic analysis, which involves ontologies and metrics based on semantic similarity. To implement our approach, we follow a modular framework for document analysis and categorization. The framework makes use of a general knowledge base, where textual representation of semantic concepts are stored.

3.1 The Knowledge Base

We realized a general knowledge base using an ontology model proposed and implemented in [6, 7]. The database is realized by means of a NoSQL graph technology. From an abstract, conceptual point of view, the model representation is based on signs, defined in [9] as “something that stands for something, for someone in some way”. These signs are used to represent concepts. The model structure is composed of a triple \({<}S,P,C{>}\) where S is the set of signs; P is the set of properties used to link signs with concepts; C is the set of constraints defined on the set P. We propose an approach focused on the use of textual representations and based on the semantic dictionary WordNet [19]. According to the terminology used in the ontology model, the textual representations are our signs. The ontology is defined using the DL version of the Web Ontology Language(OWL), a markup language that offers a high level of expressiveness preserving completeness and computational decidability. The model can be seen as a top-level ontology, since it contains a very abstract definition for its classes. The model and the related knowledge graph have been implemented in Neo4J graph-db using the property-graph-model [3].

Figure 1 shows a part of our knowledge graph to put in evidence the complexity of the implemented graph for a sake of clarity. It is composed of near 15,000 nodes and 30,000 relations extracted from our knowledge base.

Fig. 1.
figure 1

Knowledge graph excerpt with 30000 edges and about 15000 nodes

3.2 The Topic Detection Strategy

Our novel strategy for textual topic detection is based on an algorithm called SEMREL. Its representation model is the classical bag-of-words. Once a document is cleaned, i.e. unnecessary parts are removed, the tokenization step allows to obtain a list of terms in the document. Such a list of terms is the input for a Word Sense Disambiguation step that pre-processes the list assigning the right meaning to each term. Then, Semantic Networks dynamically extracted from our knowledge base are generated for all the terms, in the SN Extractor step. Common nodes between an SN resulting from each concept and SNs from other concepts are used to compute their intersections. The common nodes correspond to the degree of representation of the concept considered with respect to the entire document. This measure is indicated as Sense Coverage. The latter factor would favor the more generic concepts and for this reason a scaling factor depending on the depth of the considered concept is used. It is computed as the number of hops to the root of our knowledge base considering only the hypernymy relationships. The TopicConcept is the one with the best trade-off between the SenseCoverage and the Depth. The formula used for calculating the topic concept of a given document is shown in Eq. 1.

$$\begin{aligned} TopicConcept = max(depth(C_i)*Coverage(C_i)) \end{aligned}$$
(1)

where \(C_i\) is the i-th concept resulting from the WSD step. Only concepts in the noun lexical category are considered from the WSD list, because in the authors’ opinion they are more representative to express the topic of a document.

In the Algorithm 1, we show the logic used to find the topic concept.

figure a

The WSD attempts to palliate the issue related to term polysemy. Indeed, it tries to “sense” the correct meaning of a term by comparing each sense of a term with all the senses of the others. The similarity between terms is calculated through a linguistic based approach and a metric computes their semantic relatedness [23].

This metric is based on a combination of the best path between pairs of terms and the depth of their Lowest Common Subsumer, expressed as the number of hops to the root of our knowledge base using hypernymy relationships.

The best path is calculated as follows:

$$\begin{aligned} l(w_{1},w_{2}) = min_{j} \sum _{i=1}^{h_{j}(w_{1},w_{2})}\dfrac{1}{\sigma _{i}} \end{aligned}$$
(2)

where l is the best path length between the terms \(w_i\) and \(w_j\), \(h_j(w_i,w_j)\) corresponds to the number of hops of the j-th path and \(\sigma _i\) corresponds to the weight of the i-th edge of the j-th path. The weights \(\sigma _{i}\) are assigned to the properties of the ontological model described in Sect. 3.1 to discriminate the expressive power of relationships and they are set by experiments.

The depth factor is used to give more importance to specific concepts (low level and therefore with high depth) than generic ones (low depth). A non-linear function is used to scale the contribution of the sub-ordinates concepts in the upper level and increase those of a lower ones. The metric is normalized in the range [0, 1] (1 when the length of the path is 0 and 0 when the length go to infinite).

The Semantic Relatedness Grade of a document is then calculated as:

$$\begin{aligned} SRG(\upsilon ) = \sum _{(w_i,w_j)} e^{-\alpha \cdotp l(w_i,w_j)} \dfrac{e^{\beta \cdotp d(w_i,w_j)}-e^{-\beta \cdotp d(w_i,w_j)}}{e^{\beta \cdotp d(w_i,w_j)}+e^{-\beta \cdotp d(w_i,w_j)}} \end{aligned}$$
(3)

where \((w_i,w_j)\) are pairs of terms in \(\upsilon \), \(d(w_i,w_j)\) is the number of hops from the \(w_i,w_j\) subsumer to the root of the WordNet hierarchy considering the IS-A relation, \(\alpha \) and \(\beta \) are parameters whose values are set by experiments.

The WSD process calculates the score for each sense of the considered term using the proposed metric. The best sense associated with a term is the one which maximizes the SRG obtained by the semantic relatedness between all terms in the document.

The best sense recognition is shown in the Algorithm 2.

figure b

The best sense of a term is the one with the maximum score obtained by estimating the semantic relatedness with all the other terms of a given window of context.

3.3 The Implemented System

The system architecture is shown in Fig. 2. It is composed by multiple modules which are responsible of managing several tasks.

Fig. 2.
figure 2

The system architecture

The Web Documents can be fetched from different data sources by means of the Fetcher module and stored in the Web document repository. The textual information is first pre-processed. Cleaning operations are carried out by the Document Pre-Processor module. Such operations are: (i) tags removing, (ii) stop words deleting, (iii) elimination of special characters, (iv) stemming. The Topic Detection module uses an algorithm based on text analysis to address the correct topic of a document and our graph knowledge base. It is based on WSD and TD tasks based on the algorithm previously discussed. It is able to classify a document by the recognition of its main topic. Topic Detection result is the input of the Taxonomy Classificator used to create, with the help of our knowledge base, a hierarchy beginning from a concept. The proposed metric and approach have been compared with baselines and the results are shown in the next section.

4 Test Strategy and Experimental Results

In order to measure the performances of our framework we have carried out several experiments, which are discussed in the following. First we compare it with two reference algorithms widely used in the topic detection research field in order to have a more robust and significant evaluation: LSA [15] and LDA [4]. Latent Semantic Analysis (LSA), also known as Latent Semantic Indexing (LSI), is based on a vectorial representation of a document though the bag-of-words model. Latent Dirichlet Allocation (LDA) is a text-mining technique based on statistical models.

One of the remarkable feature of this system is that it is highly generalizable thanks to the development of autonomous modules. In this paper, we have used the textual content of DMOZ [10], one of the most popular and rich multilingual web directories with open content. The archive is made up of links to web content organized according to a hierarchy. The reason why we choose DMOZ lays into the fact that we want to compare our results with baselines. This way we can test against a real experimental scenario by using a public and well know repository. The category at the top level is the root of the DMOZ hierarchy. Since this is not informative at all, it has been discarded. Then we built a ground truth has been built considering a subset of documents from categories placed at the second level. These are shown in Table 1 together with statistics for the used test set.

Table 1. DMOZ - URLs/category
Fig. 3.
figure 3

Accuracy textual topic detection

The list of URLs is submitted to our fetcher to download the textual content. The restriction to a subset of DMOZ was necessary, due to the presence of numerous dead links and textual information. On a total of 12120 documents, we selected 10910 of them to create the topic modeling models used by LSA and LDA, while 1210 documents are used as test-set. The testing procedure employed in this paper uses our knowledge graph for the topic classification task. In order to have a fair and reliable comparison with all implemented algorithms, the same technique must be used, hence we need to perform a manual mapping of the used DMOZ categories to their respective WordNet synonyms. In this way, we create a ground truth using a pre-classified document directory (i.e., DMOZ) through a mapping with a formal and well-known knowledge source (i.e., WordNet). The annotation process also facilitates the classification of documents by other algorithms, e.g. LSA and LDA, because they give several topics that represent the main topics of the analyzed collection without dealings with the DMOZ categories. The central facet of our framework has been carefully evaluated to show the distinguishable performances of the proposed methodology. For the textual topic detection, the LSA and LDA models have been implemented and generated, as well as the proposed SEMREL algorithm in two variants. The first one consists in computing the (SRG) of a sense related to a term semantically compared with all the terms of the whole document. The second one is performed dividing a document in grammatical periods, defined by the punctuation marks dot, question mark and exclamation mark (i.e. wondows of context). The semantic relatedness of a concept is calculated considering each sense of a term belonging to its window of context. Figure 3 shows the obtained results accuracy.

We argue that these results depend to the impossibility of mapping some topics generated by LSA or LDA with the corresponding WordNet sysnset. This issue dosn’t allow an accurate topic detection due to the dependency of these models to the data set. On the other hand, SEMREL have a better concept recognition taking out noise from specific datasets.

5 Conclusion and Future Works

In this paper, we have proposed a semantic approach based on a knowledge graph for web document textual topic detection. For this purpose, a word sense disambiguation algorithm has been implemented, and semantic similarity metrics have been used. The system has been fully tested with a standard web document collection (i.e. DMOZ). The design of the system allows the use of different collections of documents. The evaluation of our approach shows promising results, also in comparison with state-of-art algorithms for textual topic detection. Our method has some limitations due to the lack of knowledge in several conceptual domains in our knowledge base (i.e. WordNet). In future works, we are interested in the definition of automatic techniques to extend our knowledge base with additional multimedia information and domain specific ontologies. Moreover, we want investigate on the novel methodologies to improve the performance of the topic detection process exploiting multimedia data considering new metrics to compute semantic similarity. Other aspects to point out are the computational efficiency of our approach and additional testing with different document collections.