1 Introduction

Topic model regards the document as the probability distribution of the topic and the topic as the probability distribution of the words. It is used in the similarity measure [1], document classification [2] and clustering [3], sentiment analysis [4, 5] and other fields and can find latent topics and get the description of a series of documents [6]. Text documents can be divided into long text, medium-length text and short text according to the length. Twitter, commodity reviews and so on are all short texts, while the State Council’s work reports, newsletters, etc. are long texts. Short texts are relatively small in length and sparse in semantic features. There are certain problems in topic extraction directly, which is the current research hot spot question. In order to express the author’s views in a coherent way, most medium and long texts often have the following common features: longer length, obvious text structure and different emphases of topics in different chapters. The traditional topic model uses the same processing method regardless of the length of the text, that is, a document is modeled as a vector represented by different topics, and the semantic topic structure inside the document is ignored, which is unreasonable to some extent. For instance, for a long work report that involves four topics simultaneously, i.e., economics, education, health care, and agriculture, the traditional topic model treats it as a separate document, which fails to recognize the semantic structure of the text. After topic modeling, it may output an iffy document topic distribution in which one of the dimension’s value is obviously larger than other dimensions, i.e., the document topic distribution may focus on only one or two of the topics. It is obvious that the obtained topic representation can not well mine the semantic features of the original text.

For this reason, we introduce the concept of semantic topic unit, and propose a method of LDA topic modeling based on partition (LDAP). Semantic topic unit can be understood as the smallest unit containing a single semantic topic. In a long document, semantic topic unit can be divided according to blocks, sentences, n-grams, paragraphs and chapters, and the topic of different semantic topic units is different. It can obtain the reasonable distribution representation finally. By dividing a longer text into topic units, we can make full use of the semantic information hidden in the text structure and achieve a better modeling effect.

The rest of the paper is organized as follows: the second part gives a brief introduction to LDA, enumerates some of the improved topic models of LDA and points out the relevant research on semantic unit. The third part describes the main ideas and basic principles of the proposed method, LDAP, in detail. Experiments and results are presented in Sect. 4 and the final part gives conclusions and discussion of future work.

2 Related Works

The development of topic models has gone through a long process. Strictly speaking, the earliest topic model is latent Dirichlet allocation (LDA) model proposed by Blei et al. [7]. LDA is a probabilistic generation model that assumes that a document is generated as follows: First, a topic is selected according to a certain probability in the document-topic distribution, then a certain word is selected from the topic-word distribution of the topic, and the above process is repeated continuously until a whole document is generated. LDA is the basis of topic model development, and most of the research on topic model is based on the improvement of LDA. The improved LDA research includes the study of document context information and syntactic structure, etc. Griffiths et al. [8] combined HMM with LDA and proposed a HMM-LDA model. Wang and Mccallum [9] introduced word collocation into the topic model and proposed a TNG model named Topical n-gram Model. Boyd-Graber and Blei [10] put forward Syntactic Topic Model. Shen et al. [11] avoided the limitation of probabilistic topic model and gave a heterogeneous topic model (HTM). There are also more research on topic modeling considering text length. Because short texts have sparse semantic features, it is difficult to model the theme directly. It is a hot research topic at present. Quan et al. [12] proposed a method for calculating the similarity between two short texts snippets. Mihalcea et al. [13] use both corpus-based and knowledge-based measures and introduce another method for measuring the similarity of short text snippets.

Although there are lots of previous work for improving LDA model in terms of short texts, few research focused on the LDA model for medium and long texts. The main reason is that semantic understanding of the medium and long texts is complex, which is difficult to model. In addition, diversified topic distribution brings difficulties to topic extraction and document representation. Existing work for medium and long texts about LDA are especially for news topic discovery. Yu et al. [14] put forward a subtopic division method basing on the proportion and distribution relations of the relevant subtopics. Nan et al. [15] divided the report into several topics according to the time slice and find the topic of news. There are also a few scholars who have attempted to classify and extract topics of medium and long texts. Lu et al. [16] proposed an information filtering method and classification model combining with neural attention for long text classification. Wang et al. [17] proposed a two phase automatic summrization method for long text named TP-AS to improve accuracy. Wang et al. [18] introduced a novel topic-based model, called the topic hypergraph, that characterizes the thematic structure of a long document with a hypergraph representation. In short, few research focused on LDA improvement for medium and long texts, and many previous work about LDA improvement ignored the semantic structure, which had limitations strictly speaking.

Semantic unit is a concept in semantics, which is used to capture the semantics of natural language. The semantics of any specific natural language sentence is called sentence meaning, and the semantic unit is a unit of expressing meaning in sentence meaning. It has five kinds of unit granularity: block, sentence, n-gram, paragraph and chapter. As the organizational structure of the text, the natural paragraph has specific semantic and pragmatic functions. And the text in the natural paragraph will be more focused on a single topic in most cases. The application of paragraph as semantic subject unit has achieved good results. The results obtained by Hearst et al. [19] using natural paragraphs are more accurate than those obtained from the whole article retrieval. Landauer et al. [20] expressed the paragraphs into diagrams and received good semantic information effects. Dai et al. [21] used paragraph vectors to embed documents and achieved high accuracy in experimental verification.

Fig. 1
figure 1

The operation flow chart of LDAP

3 Improvement of LDA Model

The proposed model is an improvement of LDA. LDA is the abbreviation of latent Dirichlet allocation, which is proposed by Blei et al. [7] in 2003. It is a generative probabilistic model of a corpus, also known as a three layer Bayesian probability model, including words, topics, and documents. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. It regards a whole document as the distribution of the topic, and obtains the latent distribution by parameter estimation.

Like LDA, LDAP is also a probabilistic model that describes a process for generating a document collection. Unlike LDA, LDAP introduces the concept of semantic topic units and models at the semantic topic unit level rather than at the document level. LDAP can be divided into three stages, namely, dividing semantic topic unit, topic modeling and subdocument merging. The operation flow chart of LDAP is shown in Fig. 1.

In the phase of dividing semantic subject units, a long text can be divided according to blocks, sentences, n-grams, paragraphs and chapters. The granularity of partitioning will have an impact on the results. If the n-grams or sentences are used as semantic topic units, it will affect the complexity of the algorithm because it is too short. It is also easy to appear problems similar to the original LDA while using a chapter as a semantic topic unit if one chapter is long enough. Therefore, LDAP uses paragraphs as semantic topic units. After dividing the semantic topic units according to the paragraphs, it is expected that each paragraph in the document in the corpus will form a subdocument, which will reconstitute a large corpus. The second stage is to use LDA to model the newly generated subdocument corpus to obtain subdocument-topic distribution and topic-word distribution. In order to obtain the corresponding topic distribution of the original document, the final stage needs to merge the resulting subdocument-topic distribution. The method used here is to assign a weight value to each subdocument, which represents the importance of the subdocument in the original document. The topic distribution probability of the original document is obtained by considering both the weight value and the subdocument topic distribution. The whole process of the LDAP algorithm can be described as follows:

figure a

The graphical model of the LDA and LDAP is shown in Fig. 2.

Fig. 2
figure 2

The graphical model representation

LDA is a three-layer Bayesian probability generation model. It adds a semantic topic unit layer between the document layer and the topic layer based on LDA. \(w_{p,n}\) is observed variable and \(\theta _m\), \(\theta _p\), \(\phi _k\) are the distribution to be estimated. According to basic idea of LDAP, there is the relationship between \(\theta _m\) and \(\theta _p\) because \(\theta _m\) is weighted sum of \(\theta _p\). So we can define a transition matrix R from the distribution of document-topic \(\theta _m\) to the distribution of subdocument-topic \(\theta _p\). If a paragraph is selected as a semantic topic unit, R can be calculated from the weight vector \(r = (r_1,r_2,\ldots ,r_p)^T\) of the subdocument about the original document:

$$\begin{aligned} R = r^+ \end{aligned}$$
(1)

where

$$\begin{aligned} r_i=\frac{length(p_i)}{length(m)} \end{aligned}$$
(2)

where length(i) is the length of text i , \(p_i\) is the paragraph i, m is the document m, and \(r^+\) is the generalized inverse matrix of matrix r. It is worth noting that r here refers to the semantic importance of each semantic topic unit (paragraph) in the entire document. Considering the complexity of the calculation, we assume that the number of valid terms contained in a paragraph can reflect the richness of the semantics of this paragraph to a certain extent. We used the proportion of the length to calculate the weight. The notation is summarized in Table 1.

Table 1 Notation used in the LDA and LDAP model

The generative process of a corpus by LDAP can be described as follows:

(1)   For each topic \(k\in [1,K]\):

      a. Draw a multinomial \(\phi _k\) from \(\beta \).

(2)   For each document d in corpus D:

      a. Draw a multinomial \(\theta _m\).

      b. For each subdocument p in document d:

            i. Draw a multinomial \(\theta _p\) from \(\alpha \).

            ii. For each subdocument p:

                  \(\bullet \) Draw a topic \(z_{p,n}\) from \(\theta _p\).

                  \(\bullet \) Draw a word \(w_{p,n}\) from \(\phi _k\).

Each document in the corpus is divided into subdocuments then several subdocuments form a new corpus to model the topic on the new corpus. For any subdocument in the corpus, the generation probability is

$$\begin{aligned} p(z_p,w_p,\theta _p,\phi | \alpha ,\beta ) = \prod _n^{N_p} p(w_{p,n} | \phi _{z_{p,n}})p(z_{p,n}|\theta _p)p(\theta _p|\alpha )p(\phi |\beta ), \end{aligned}$$
(3)

and the probability of the corpus is

$$\begin{aligned} p(z_m,w_m,\theta _m,\phi | \alpha ,\beta ) = \prod _{m=1}^{M} \sum _{p=1}^{P_m} r_p \prod _{n}^{N_p} p(w_{p,n} | \phi _{z_{p,n}})p(z_{p,n}|\theta _p)p(\theta _p|\alpha )p(\phi |\beta ),\nonumber \\ \end{aligned}$$
(4)

where \(r_p\) is the weight of the subdocument p about the original document as above. Through observation, it can be found that the main difference between LDAP and LDA is that LDAP is using LDA to process paragraphs of documents. Then each paragraph is weighted to find out the topic distribution of documents. The corresponding parameters by the parameter estimation process is similar to the LDA. Through the method of parameter estimation by Gibbs sampling, the formula for calculating the required parameters is as follows:

$$\begin{aligned} \phi _{k,t}= & {} \frac{n_k^{(t)}+\beta _t}{\sum _{t=1}^{V} (n_k^{(t)}+\beta _t)}, \end{aligned}$$
(5)
$$\begin{aligned} \theta _{p,k}= & {} \frac{n_p^{(k)}+\alpha _k}{\sum _{k=1}^{K} (n_p^{(k)}+\alpha _k)}, \end{aligned}$$
(6)
$$\begin{aligned} \theta _{m,k}= & {} \sum _{p=1}^{P_m} r_p \cdot \theta _{p,k} = \sum _{p=1}^{P_m} r_p \cdot \frac{n_p^{(k)}+\alpha _k}{\sum _{k=1}^{K} (n_p^{(k)}+\alpha _k)}, \end{aligned}$$
(7)

where \(r_p\) is the weight of paragraph p.

4 Experiments and Results

In this section, firstly we briefly introduce the datasets, setup for experiment and the evaluation methods to evaluate the proposed method. Then we give the numerical results of using different methods to model the two corpora. The comparison to demonstrate the effectiveness of the proposed method is also presented.

4.1 Experimental Datasets and Setup

In the experiment, two corpora which are collected and sorted by Fudan University and Sougou Lab are used to verify the effectiveness of the proposed method. Both of these two corpora are medium or long texts. Considering the complexity of the algorithm, we randomly selected three categories in Fudan corpus and four categories in Sougou corpus. 1000 documents from each category is obtained. According to the needs of the experiment, we took 4/5 of the total data set as the training set and the remaining 1/5 as the test set. And the detail of experimental data are shown in Table 2.

LDAP can express a document as the distribution of topics to reflect the characteristics of a document. Long length of the two corpora and the concentrated topic distribution explained they basically have the characteristics of medium and long documents. It is rational to model by LDAP. To evaluate the effectiveness of the proposed method, we use other document representation methods such as LDA, HDP [22], LSA [23] and doc2vec [24] in deep learning to represent the same corpus and the classifier is random forest. What is more, in order to avoid the fact that some of the documents and paragraphs contained in the corpus are less unsuitable for splitting and affect the effect of the topic modeling, the data in the corpus is preprocessed. We delete a few paragraphs with fewer word items when a paragraph contains less than five words.

Table 2 Description for experimental corpora

4.2 Evaluation Methods

In order to evaluate the effectiveness of classification of documents, we use the classical precision, recall and F1 measure for the testing corpus [25]. Precision is the ratio of correct assignments by the system to the total number of the systems assignments. Recall is the ratio of correct assignments to system divided by the total number of correct assignment. In general, the precision and recall value are contradictory, and if one value gets bigger may lead to the other value become smaller. Therefore, a comprehensive measure of these two values is in need. F1 measure is the harmonic mean of precision and recall. The calculation methods of these values are as follows:

$$\begin{aligned} P_i = \frac{A_i}{A_i+B_i}, R_i = \frac{A_i}{A_i+C_i}, F1_i = \frac{2*P_i*R_i}{P_i+R_i} \end{aligned}$$
(8)

where \(A_i\) represents the number of records that the prediction is True and the truth is True, \(B_i\) represents the number of records that the prediction is True and the truth is False and \(C_i\) represents the number of records that the prediction is False and The Truth is True. We assume that there are C categories in the experimental corpus. The overall macro-averaged precision, recall and F1 measure values for the testing corpus, denoted as \(Macro\_P\), \(Macro\_R\) and \(Macro\_F1\), respectively, can be calculated as follows:

$$\begin{aligned} Macro\_P = \sum _{i=1}^C \frac{P_i}{C}, Macro\_R = \sum _{i=1}^{C} \frac{R_i}{C}, Macro\_F1 = \sum _{i=1}^{C} \frac{F1_i}{C} \end{aligned}$$
(9)

In order to avoid the error caused by chance, we repeat the experiments in each group for 20 times and obtain the final \(Macro\_P\), \(Macro\_R\) and \(Macro\_F1\) by calculating the average values of 20 groups.

Table 3 Comparison on macro-averaged precision rate (P) (%), recall rate (R) (%) and F1 rate (F1) (%) measure values between different documents representation methods of different number of features based on the corpus from Sougou Lab
Table 4 Comparison on macro-averaged precision rate (P) (%), recall rate (R) (%) and F1 rate (F1) (%) measure values between different documents representation methods of different number of features based on the corpus from Fudan University

4.3 Results and Discussion

The related macro-averaged precision, recall and F1 measure values of the check experiment on two corpora are shown in Tables 3 and 4. Table 3 shows the experimental results on the corpus from Sougou Lab while Table 4 on the corpus from Fudan University. There are 8 numbers of features including 20, 50, 100, 120, 150, 200, 250 and four different document representation methods including LDA, LDAP, HDP, LSA and doc2vec. The best results of each index under the different number of features are bolded. Number of features can be regarded as number of topics in topic model.

Because this article is an improvement on the LDA model, we compare LDA and LDAP more systematically, and the results are shown in Fig. 3. The experimental results show that the proposed LDAP model is more effective than the LDA model except in some cases. In Table 3, the all evaluation index of the classification results based on the document representation of LDAP are all best when the number of features is 50, 100, 150, 200 and 250. Comparing LDAP with doc2vec, we can see only in a few cases does LDAP perform less well than doc2vec. Except when the number of features is 20, the classification effect obtained by using LDAP method is generally better and the average distribution of evaluation index is about 84% while using doc2vec method is about 81%. The results by LSA model is unsatisfied in most K and need of improvement. The performance of HDP is far beneath LDAP in performance in all cases. According to Table 3 and Fig. 3, we can fully compare the results of LDA and LDAP. From Fig. 3a, it can be found that only when the number of features is selected 20 and 120 that the P value of LDA is a little larger than that of LDAP. When the number of features is 120, using LDA model can get the best classification result, in which case that the P, R, F1 are 85.22%, 81.38% and 81.39%. And the average distribution of evaluation index is about 84% while using LDA method is about 80%. That is to say, compared with the LDA method, using the LDAP method can improve the classification accuracy by about 4%. From the Fig. 3c we can find that LDAP has a greater advantages than LDA on Sougou corpus.

Fig. 3
figure 3

Comparison of LDA and LDAP classification results on Sougou Corpus and Fudan Corpus. a The columnar graphs of P, R and F1 values corresponding to the two methods under different feature numbers on Sougou Corpus. b The columnar graphs of P, R and F1 values corresponding to the two methods under different feature numbers on Fudan Corpus. c A broken line graph of the corresponding F1 values of the two methods under different number of features on Sougou Corpus. d A broken line graph of the corresponding F1 values of the two methods under different number of features on Fudan Corpus

For the corpus of Fudan University, the LDAP method also shows its own advantages. In the Table 4, it is obvious that the classification results by LDAP is more effective than LDA and doc2vec in most cases. For the selected eight feature number values, the doc2vec method based on deep learning does not show good performance and the average value is generally about 85%. By contrast, the average value of using LDAP algorithm is about 88%, and when the number of selected features is 250, the LDAP algorithm can reach an evaluation value of nearly 93%. Although in Fudan Corpus, HDP achieve better result than in Sougou Corpus, it is not as good as LDAP. As for LSA, the classification performance is stable at a poor level. Compared with using Sougou corpus, the gap between LDA and LDAP is not particularly large between LDA and LDAP on Fudan University Corpus but there are still gaps. Using LDA, the average of evaluation values is about 85–86% while using LDAP can get 88%. In summary, the LDAP algorithm presents a better performance than LDA, HDP, LSA and doc2vec.

5 Conclusions

In this paper, we propose an improved LDA topic modeling method based on partition (LDAP), which is more suitable for the topic modeling on the medium and long text. By dividing the document according to the semantic topic unit, the semantic information hidden in the text structure can be fully utilized. It models at the level on semantic topic units instead of on documents. The experimental results on the Sougou corpus and Fudan corpus illustrate the better performance of the proposed method than LDA, HDP, LSA and doc2vec based on deep-learning.

Although LDAP has achieved better experimental results, it still has some limitations. Some middle and long texts lack the necessary text structure. In this case, the result obtained by LDAP may not be ideal. In addition, using paragraphs as semantic topic units is controversial. In addition, although LDAP adds a layer of subdocument-topic distribution to LDA, it does not fundamentally change the basic assumption of bag of words and there are still some problems. In the following research, we will improve the method of dividing semantic topic units and weight measurement, expecting to achieve better modeling results.