1 Introduction

Nowadays, with the development of Internet technology, we have entered the information age quickly. Massive data is created and transformed to the virtual environment rapidly every day and most of them exists in a form of textual document (Sebastiani 2002; Al-Mubaid and Umair 2006; Debole and Sebastiani 2003). Meanwhile, people’s demand that improving the matching degree between retrieval words and provided documents is also raising (Li et al. 2011). Therefore, it is essential to classify these documents accurately according to their content. However, confronted with the continuously growing text data, it is inefficient and impractical to deal with a large amount of information only by manual work (Labani et al. 2018). Luckily, text classification (TC) technology rising in recent years can be utilized to accomplish this target (Shang et al. 2013; Tellez et al. 2018). TC is such a task which aims at assigning corresponding category labels to the documents in accordance with their respective topics by specific classification algorithms (Shang et al. 2007; Zhang et al. 2011; Haddoud et al. 2016). This technology has been applied in many fields, including spam filtering (Li and Liu 2018), web page detection (Deng et al. 2020) and function extraction of patent texts (Li et al. 2017; Liu et al. 2020). However, computers cannot identify document texts like human beings do. Therefore, it is essential to transform the format of these texts into an appropriate one in order to make them recognized by computers and classifiers successfully, and this process of transformation is called text representation (Lan et al. 2009). Currently, vector space model (VSM) is a widely-used text representation method in which documents are transformed into vectors weighted by specific measurements (Salton et al. 1974; Zong et al. 2015). In this model, the document can be represented by the form of \(d_{k} = (t_{1} ,t_{2} ,...,t_{n} )\) with a corresponding weight of \(w = (w_{1} ,w_{2} ,...,w_{n} )\), in which n denotes the number of selected features and dk denotes a specific document and w is the set of all term weights (Sabbah et al. 2017; Wu et al. 2017). In the whole process, the assignment of term weight is a vital step because the weight demonstrates the importance of a specific term and the contribution made by this term in classifying different kinds of texts (Debole and Sebastiani 2003; Lan et al. 2009; Guru et al. 2018).

In order to assign appropriate weights to terms, it is essential to choose a reasonable term weighting method. Generally, term weighting methods can be mainly divided into two categories, supervised term weighting (STW) methods and unsupervised term weighting (UTW) methods according to whether the predefined category information is utilized or not (Lan et al. 2009; Wang and Zhang 2013; Ren and Sohrab 2013; Chen et al. 2016). Due to different term weighting methods adopt various models, therefore, even if these methods belong to the same UTW or STW type, there are still many differences between them. For example, the TF method only takes the term frequency into account when computing the term weight, and it holds the assumption that the higher the frequency of appearance of a term, the more important this term is. On the contrary, the IDF method is based on the assumption that the less documents in which a term appears, the more significant the term will be, which ignores the effect of term frequency completely (Zhang et al. 2011; Sabbah et al. 2016). Considering that both TF and IDF have respective defects as a single term weighting method, TF-IDF was then proposed by combining them together (Li et al. 2016), which supports the idea that the term weight should be measured from both term frequency and document frequency comprehensively (Spärck 2004; Salton and Buckley 1988). It should be noticed that the three term weighting methods mentioned above are all UTW methods as the available category information is not utilized in the text training process. From the core ideas of the above methods, we can see that the term weight measured by them only reflect the relationship between terms and documents as well as documents and documents. However, TC is a supervised learning task which aims at classifying various texts into different categories according to their content (Lan et al. 2009). It seems that UTW methods cannot achieve the goal of TC with a satisfactory result.

In order to obtain a more reasonable term weighting method, it is natural to take the category information into account to match the TC task which is also a supervised learning process (Altınçay and Erenel 2010; Tang et al. 2020). Relative to the UTW method, the method which make use of the prior category information when computing the term weight, is called STW method (Guru et al. 2018). Most STW methods follow the pattern of TF-IDF which is consisted of local part and global part. Besides, it is publicly accepted that term frequency (TF) is an excellent representation of local weight, so the TF part is still retained while the IDF part is substituted by others in these new-proposed STW methods. For example, considering that the methods Chi-square (CHI2), information gain (IG) and mutual information (MI) perform well in the feature selection procedure, there is probability that these methods also apply to measuring the term weight equally. Therefore, replacing the IDF with CHI2, IG and MI separately, three new STW methods are proposed, which are named as TF-CHI2, TF-IG and TF-MI (Debole and Sebastiani 2003). In view of that the number of categories containing the specific term may contain useful information for TC task, then two STW methods named TF-ICF (Wang and Zhang 2013), TF-IDF-ICF (Ren and Sohrab 2013) generated as a result. Chen et al. (2016) proposed that the traditional TF-IDF is not fully effective for TC task, in order to make some improvements, based on a new statistic model, a new STW method named TF-IGM and its variants were proposed which claims that this method can make full use of the fine-grained term distribution across different classes of texts. In addition, there are also a variety of STW methods based on different models like TF-OR (Altınçay and Erenel 2010) and TF-PB (Liu et al. 2009).

Intuitively, the STW methods should have performed better than UTW ones in terms of text classification performance because they make full use of the predefined category information. Actually, as the representative of UTW methods, TF-IDF shows better performance than some STW methods, this phenomenon is conflict with our intuition (Lan et al. 2009; Quan et al. 2011). Aiming at this problem, we have analyzed the above listed supervised term weighting methods and found that part of them become invalid under some special circumstances. For example, the category frequency (CF) part of TF-ICF and TF-IDF-ICF, it represents the total number of categories whose documents contain the chosen term. In the two models, the number of documents containing the chosen term in a specific category has no effect on the term weight, that is to say, one document or ten documents belong to the same category in which the chosen term appears is regarded as no difference, this is obviously unreasonable. To eliminate this defect, TF-IDF-ICSDF (Ren and Sohrab 2013) was proposed by implementing a new model named inverse class-space-density frequency. However, it will degenerate into TF-IDF when the number of documents in each category is the same (Chen et al. 2016). It is not a unique instance, a similar situation will also happen to TF-IGM which makes it become invalid when the number of different kinds of documents meets certain conditions. Hence, in order to offset this shortcoming, a novel method named TF-IGMimp (Dogan and Uysal 2019) was proposed by adding a ratio to the initial TF-IGM. Among these STW methods, term frequency-relevance frequency (TF-RF) proposed by Lan et al. (2009) is considered as an outstanding method with reasonable theoretical explanation and good classification performance. More importantly, similar failure circumstances will not happen in TF-RF. It is noticeable that most of the listed STW methods adopt the same logarithmic processing borrowed from TF-IDF to their respective global parts, but it is not clear whether the logarithmic processing is beneficial to the performance of TC or not. To explore this problem, two improved methods named TF-ERF and ETF-RF are proposed by strengthen the RF part and TF part separately. As a result, the method TF-ERF is proved to be more helpful to the improvement of TC and it shows certain advantages over other term weighting methods.

The rest of this paper is arranged as follows. Section 2 points out problems existing in the related work. Section 3 proposes improved term weighting methods based on TF-RF. Section 4 introduces the experimental settings. Section 5 analyzes the experimental data in detail. Section 6 concludes this paper.

2 Analysis about current term weighting schemes

In this section, we conduct analysis about some existing term weighting schemes with good TC performance. Meanwhile, the existing problems are also raised. For convenience, the notations utilized in this study are first presented in Table 1 and seven existing representative term weighting methods are shown in Table 2. By the way, the mathematical forms of these term weighting schemes are all not normalized.

Table 1 Notations and descriptions
Table 2 Term weighting methods to be compared

2.1 Unsupervised term weighting method

As a widely used UTW method, TF-IDF shows a good TC performance, which is consisted of a local part and a global part, named as TF and IDF separately. TF supports the assumption that the more frequent a term appears in a specific document, the greater contribution it makes to the representation of this document. That is to say, the term is more important to this document (Lakshmi and Baskar 2019). This conclusion can be intuitively obtained, but is it really reasonable? We can imagine a scenario like this, there is a document in which a term (t1) appears 20 times while another term (t2) appears only once, can we directly come to the conclusion that t1 is 20 times more important than t2? The answer is obviously not. By the way, we have mentioned that TC is a task whose goal is to classify a variety of documents into corresponding categories according to their content. It is obvious that terms which possess the ability to distinguish between different types of documents ought to be distributed a higher term weight. However, the TF weight only reflects the ability of a term in representing the document containing this term (Zhang et al. 2019), this is against with the target of TC task. The importance of a term is measured only in the dimension of term-document and TC task in only limited in a certain text without considering the relationship between all texts, which results in a bad performance of classification. Therefore, we can conclude that the single TF cannot make meaningful contribution as a term weighting method to serve the TC task.

In order to make up the defect of the single TF part, the IDF part is introduced to balance the excessive influence of TF on the final term weight (Shang et al. 2013). With regard to a specific term, its IDF weight can be defined as

$$ {\text{IDF}}(t_{j} ) = \log \left( {\frac{N}{{df(t_{j} )}}} \right) $$
(1)

It can be seen from Eq. (1) that the smaller the value of \(df(t_{j} )\), the larger the value of \({\text{IDF}}(t_{j} )\), which holds the assumption that the less documents in which a term appears, the more significant the term will be. By introducing the IDF part to the TF part, term weighting method TF-IDF was generated as a result. Intuitively, this new model is more appropriate as it takes both local part and global part into account, and the actual TC performance is also consisted with the intuition. However, there are still some defects existing in this method, a simple example can be given to explain this. Assuming that there are four kinds of texts in the training text sets and they all consist of 60 documents. Four terms with the same term frequency are selected and their document frequency can be represented as {60, 0, 0, 0}, {30, 30, 0, 0}, {20, 20, 20, 0} and {15, 15, 15, 15} separately. It can be easily seen that the ability to discriminate different classes is ranked as t1 > t2 > t3 > t4. However, the four terms are assigned the same IDF value, which is contrary to our intuition. Actually, although the global factor is taken into account in the TF-IDF method, its incomplete application of global weight leads to these extreme cases occasionally which make the method invalid. This can be attributed to the absence of available category information in the process of computing term weight, so it is essential to measure the importance of a term in the document-category dimension.

2.2 Supervised term weighting method

Due to the TC task is a supervised learning process aiming at classifying different types of documents, it is natural to utilize STW methods to match the TC process (Dogan and Uysal 2019). However, there are some defects in most existing STW methods, which will make the methods invalid in some extreme situations. There we take TF-IDF-ICSDF and TF-IGM as examples to illustrate the extreme situations that can lead to the failure of the term weighing methods. For TF-IDF-ICSDF, assuming that there are four kinds of texts in the training text sets and they all consist of 100 documents. Four terms with the same term frequency are selected and their document frequencies can be represented as {60, 0, 0, 0}, {50, 50, 0, 0}, {40, 40, 40, 0} and {20, 20, 20, 20} separately. It can be easily seen that the ability to discriminate different classes is ranked as t1 > t2 > t3 > t4. However, the four terms are assigned the same ICSDF value because the number of documents in four categories are the same, which can be calculated by the formula given in Table 2. This leads to the fact that the ICSDF part has no effect on term weighting, so the TF-IDF-ICSDF model degenerate into TF-IDF model. For TF-IGM, similarly, assuming a scenario that there are four kinds of texts in the training sets and they all consist of 100 documents. Four terms with the same term frequency are selected and their document frequency can be represented as {90, 0, 0, 0}, {60, 0, 0, 0}, {30, 0, 0, 0} and {10, 0, 0, 0} separately. Intuitively, the order of class distinguishing power must be t1 > t2 > t3 > t4, but by the formula shown in Table 2, the standard IGM values of t1, t2, t3, t4 are all equal to 1 which represents that all the four terms own the same distinguishing power, this is obviously unreasonable. As can be seen from the extreme cases of the above two examples, the two STW methods have certain requirements for texts and lacks robustness for different types of texts.

2.3 Relevance frequency

Apparently, the above-mentioned term weighting methods regard TC task looks as a multi-label classification problem, but in fact, what we need to do is just separate the chosen category from others instead of taking every unrelated category into account. Following this thought, TF-RF was proposed which simplifies the multi-label classification problem into multiple independent binary classification problems (Lan et al. 2009). Specifically speaking, in the training text corpus, when computing the weight of a specific term, the category of document containing this term is tagged as the positive category and the other categories are uniformly classified as the negative category (Lan et al. 2009). TF-RF supports the idea that the more concentrated the chosen term is in the positive category than in the negative category, the greater ability it possesses to select a correct category for the documents containing it. Besides, TF-RF also holds the assumption that the importance of a term is only related to the documents containing it (Lan et al. 2009). Based on the two thoughts mentioned above, the term weighting method TF-RF was proposed which can be presented with the form shown in Table 2. It can be noticed that the logarithmic processing is also adopted on the global part of TF-RF.

Considering that logarithmic processing has certain restrictions on its parameters, we need to limit the parameters with a certain range to prevent failure. In terms of TF-RF, firstly, in order to avoid the RF part being meaningless when the chosen term doesn’t appear in positive documents, i.e., a = 0, the minimum of the independent variable is limited to 2. Meanwhile, the base is also set to 2 to match the independent variable. With this processing, the RF value becomes 1 when a = 0, it is logical that the term weight depends on the TF weight entirely in this case. Secondly, the minimum of the denominator is limited to 1 for the purpose that preventing the RF value becomes infinite. It seems that the processed formula of TF-RF shown in Table 2 solves the problems existing perfectly. However, due to the TF part depends on the frequency a certain term appears in a document, it is bound to be affected by the length of the document. Therefore, the normalization is always performed to eliminate the influence of different text lengths when computing the term weight. As a result, the normalized TF-RF formula can be defined as.

$$ {\text{TFRF}}(t_{j} ,d_{k} ,c_{i} ) = \frac{{tf(t_{j} ,d_{k} ) \cdot \log_{2} (a/\max (1,c) + 2)}}{{\sqrt {\sum\limits_{j = 1}^{n} {\left( {tf(t_{j} ,d_{k} ) \cdot \log_{2} (a/\max (1,c) + 2)} \right)^{2} } } }} $$
(2)

We have analyzed that the introduction of IDF is to balance the excessive influence of the single TF on the term weight. For the same purpose, the global part RF is introduced to generate a more reasonable term weighting model and the logarithmic processing borrowed from IDF part is adopted on the RF part directly. Although TF-RF presents outstanding performance in TC task, it is not clear whether the logarithmic processing adopted on the RF part contribute to the good performance or not. Maybe there is possibility that the logarithmic processing restricts TF-RF to achieve better TC performance. Therefore, a problem can be naturally proposed, that is, “does the logarithmic processing adopted on IDF also apply to RF?” In terms of this problem, related analysis will be carried out in the next section.

3 Proposed method

In this section, aiming at solving the problem mentioned in the previous section, two assumptions are proposed along with corresponding strategy. In addition, the reliability of the two assumptions are also discussed at the end of the section.

3.1 Motivation

Intuitively, if terms are assigned the same term weight, the conclusion that they possess equivalent importance can be drawn intuitively. But is the actual situation really as simple as it seems? As mentioned before, most term weighting methods consist of two parts, local part and global part, and the term weight is the product of their respective weights of the two parts. It seems that the term weight gives a comprehensive consideration from the two parts. Conversely, we can interpret it from another perspective, that is, the respective characteristics of the two parts cannot be fully shown because the size relation between local weight and global weight is neglected when multiplying them. Here an example can be given to make an explanation about this. Assuming that there are two terms named t1 and t2, the TF weights of t1 and t2 are 1 and 100 separately and the RF weights are 100 and 1 separately, it is obvious that their term weights are the same. However, there is great difference between the two terms because the RF weight has absolute dominance over term weight for t1 while t2 is just the opposite, which leads to the influence of the weak part on the term weight is suppressed by the strong part.

In terms of the TF-RF model we are studying, it has a relatively good performance for TC task, but we don’t know whether the logarithmic processing adopted on the global part RF of TF-RF borrowed from TF-IDF contributes to the excellent performance or not. Besides, there is even possibility that the influence of TF part and RF part on the term weight is out of balance due to the logarithmic processing, resulting that the characteristic of one part is concealed by the other part. As a result, TF-RF cannot measure term weight from the two parts appropriately. Based on this problem, two assumptions can be proposed as following.

Assumption 1

The RF part is weakened too much through logarithmic processing. In other words, the impact of the TF part is overemphasized, which results in the RF part is suppressed by the TF part in contributing the discriminating power to the selected term.

Assumption 2:

The weakening of RF part is not enough only by logarithmic processing. The RF part still occupies excessive dominance on the term weight, which leads to suppressing the effect of the TF part in contributing the discriminating power to the selected term.

The two assumptions mentioned above can be presented as Fig. 1. It can be seen from the figure that some terms with different characteristics are assigned the same weight for initial TF-RF model. Among these terms, some are bias to tf and others prefer rf. For Assumption 1, due to the RF part has a greater dominance on the term weight which will restrain the expression of the characteristic of the TF part, so the dominance of TF part should be strengthened to balance the overemphasis on the RF part in order to make the method more reasonable. For Assumption 2, it is just opposite to Assumption 1. Therefore, the impact of TF part should be enhanced to balance the excessive influence caused by the RF part. Naturally, the result is also contrary to the result of Assumption 1. In view of the two assumptions, their respective strategies will be analyzed in detail in the next section.

Fig. 1
figure 1

Two assumptions based on TF-RF

3.2 Improvement approaches

In view of the two assumptions mentioned above, in order to gain a more reasonable term weighting method, the part which is suppressed by the other part for the dominance on the term weight, is ought to be strengthened. And there are two strategies shown in Fig. 2 which can be used to achieve this purpose.

Fig. 2
figure 2

Framework of the proposed method

Strategy 1: Adding a coefficient k as multiple to the certain part (k·tf or k·rf, k > 1).

Strategy 2: Adding a coefficient k as power exponent to the certain part (tfk or rfk, k > 1).

For assumption 1, adopting strategy 1 to the normalized TF-RF shown as Eq. (2). Then the deformed formula can be defined as Eq. (3), in which k = krf.

$$ \begin{aligned} {\text{TFRF}}(t_{j} ,d_{k} ,c_{i} )* & = \frac{{tf(t_{j} ,d_{k} ) \cdot k_{rf} \cdot \log_{2} (a/\max (1,c) + 2)}}{{\sqrt {\sum\limits_{j = 1}^{n} {\left( {tf(t_{j} ,d_{k} ) \cdot k_{rf} \cdot \log_{2} (a/\max (1,c) + 2)} \right)^{2} } } }} \\ \, & = \frac{{tf(t_{j} ,d_{k} ) \cdot \log_{2} (a/\max (1,c) + 2)}}{{\sqrt {\sum\limits_{j = 1}^{n} {\left( {tf(t_{j} ,d_{k} ) \cdot \log_{2} (a/\max (1,c) + 2)} \right)^{2} } } }} = {\text{TFRF}}(t_{j} ,d_{k} ,c_{i} ) \\ \end{aligned} $$
(3)

It can be noticed from Eq. (3) that the coefficient krf added to the RF part will be offset, so strategy 1 becomes meaningless. Then adopting strategy 2 to Eq. (2), the deformed formula can be defined as Eq. (4). Obviously, the failure will not happen in strategy 2. So after the identification, strategy 2 is chosen as a feasible to strengthen the RF part, and the improved method Eq. (4) is named as term frequency-exponential relevance frequency (TF-ERF).

$$ {\text{TFERF}}(t_{j} ,d_{k} ,c_{i} ) = \frac{{tf(t_{j} ,d_{k} ) \cdot \left( {\log_{2} (a/\max (1,c) + 2)} \right)^{{k_{rf} }} }}{{\sqrt {\sum\limits_{j = 1}^{n} {\left( {tf(t_{j} ,d_{k} ) \cdot \left( {\log_{2} (a/\max (1,c) + 2)} \right)^{{k_{rf} }} } \right)^{2} } } }} $$
(4)

For assumption 2, the same conclusion can be drawn as proposition 1. Strategy 1 will also become invalid due to the coefficient ktf added to the TF part is offset. As a result, strategy 2 is selected to strengthen the TF part. Adopting approach 2 to Eq. (2), then the deformed formula can be defined as Eq. (5), which is named as exponential term frequency relevance frequency (ETF-RF), in which k = ktf.

$$ {\text{ETFRF}}(t_{j} ,d_{k} ,c_{i} ) = \frac{{tf(t_{j} ,d_{k} )^{{k_{tf} }} \cdot \log_{2} (a/\max (1,c) + 2)}}{{\sqrt {\sum\limits_{j = 1}^{n} {\left( {tf(t_{j} ,d_{k} )^{{k_{tf} }} \cdot \log_{2} (a/\max (1,c) + 2)} \right)^{2} } } }} $$
(5)

3.3 Qualitative analysis of two improved methods

In terms of the methods proposed in previous section, their actual effect performance can be presented as Fig. 3. Figure 3a, b present the result for different values of tf and rf by adding a coefficient krf (krf = 1, 2, 3) or ktf (ktf = 1, 2, 3) to the corresponding RF or TF part separately. It can be seen from the figure that the slope of the surface shown in Fig. 3b is steeper than that of Fig. 3a for the same ktf and krf. In addition, we can also notice that there is little difference between the partial derivative in direction tf and that of rf for Fig. 3a though the coefficient krf is introduced to strengthen the RF part. While in Fig. 3b, the partial derivative in direction tf far exceeds that in direction rf due to the introduction of ktf. This proves that the introduction of ktf has greater influence on the term weight than krf. Due to the initial TF-RF is an excellent term weighting scheme which presents a good performance in TC task, the respective influence of the TF part and RF part on the term weight is reasonable to a certain extent. Therefore, the treatment to strengthen the influence of TF part on the term weight by adding a coefficient ktf may greatly undermine this rationality and make the classification performance reduced. By the way, similar analysis and conclusion also mentioned in Lan et al. (2009), which is consistent with what we have analyzed. So if the initial TF-RF can be improved, there is a great possibility that the dominance of the RF part on the term weight ought to be enhanced, that is to say, TF-ERF may be more helpful to the improvement of TF-RF compared with ETF-RF.

Fig. 3
figure 3

Performance adding different ktf and krf to TF part and RF part separately

4 Experimental setup

In this section, the experimental datasets, feature selection method, classification algorithms and evaluation of the performance measures used in our experiments are introduced successively.

4.1 Experimental datasets and pre-processing

In order to verify whether the improved methods proposed before is helpful to improve the performance of TC task, a series of experiments are conducted. Experiments are conducted on two datasets, Reuters-21587 corpus and WebKB corpus.

4.1.1 Reuters-21587 corpus

The first dataset used in our experiments is the Reuters-21587 corpus, which is widely used in the text classification field. This English corpus contains 90 classes of news documents. The top 8 largest classes are selected for the text classification task. The reduced corpus contains 7674 documents which have been divided into a training set with 5485 documents and a test set with 2189 documents (shown in Table 3).

Table 3 Data description on reuters-21578

The pre-processing is an important step which will make an effect on the result of text classification to a certain degree. In this step, punctuation marks, numbers and other symbols are all removed. Furthermore, in order to reduce the size of the feature set, the terms which appear less than two times are discarded. At last, all letters are converted to lowercase and words are stemmed using Porter’s stemmer (Porter 2006).

Finally, a total of 8541 distinct terms left build the feature set. After the pre-processing stage, we acquire the document-term matrices of training set and test set, which are 5485 × 8541 and 2189 × 8541, respectively.

4.1.2 WebKB corpus

The second dataset used in our experiments is the WebKB corpus, this English corpus contains 4199 documents which have been divided into a training set with 2803 documents and a test set with 1396 documents (shown in Table 4). In the step of pre-processing, the same pre-processing mentioned previous is also carried out on the WebKB corpus. Eventually, a total of 7061 distinct terms (features) are selected to build the feature set, the document-term matrices of training set and test set are 2803 × 7061 and 1396 × 7061, respectively.

Table 4 Data description on WebKB

4.2 Feature selection

The initial feature set is achieved after the pre-processing of the datasets. However, the feature set cannot be directly applied to experiments due to the fact that masses of features are meaningless for TC task. In addition, these invalid features may also cause harmful effect to the classifier, which will lead to a bad TC performance (Meng et al. 2011; Wang et al. 2015). Therefore, it is essential to select the most effective features on the promise of not sacrificing the performance of TC task. As we know, feature selection (FS) is such a task aiming at building a more reasonable model for TC task by selecting relatively valuable terms. Presently, there are many methods that can be applied to feature selection. For example, three typical methods have been mentioned in Sect. 1 named CHI2, IG and MI separately. Among them, CHI2 is recognized to be the most effective method due to its outstanding performance (Liu and Yu 2005; Taşcı and Güngör 2013; Şahin and Kılıç 2019), so it is adopted to select features in this study.

4.3 Classification algorithms

Bayes classifier is the general name of a type of classification algorithm which are all based on Bayes law. Among them, Naive Bayes (NB) is the most common and simple one which is widely used in the field of TC (Yang and Pedersen 1997). NB treats all features as independent and no interaction and it regards the TC task as a probability problem that the class with the highest probability will be selected as the final category (Friedman et al. 1997; Ning et al. 2021). Based on this idea, the number of parameters to be estimated is greatly reduced, which simplifies the requirements of feature space and the calculation of solution to a great extent. As a result, the simplicity and efficiency will be greatly promoted when using NB classifier. In view of these advantages, NB classifier is utilized in our experiments and its algorithm interpretation is presented as Eq. (6).

Assuming that the document dk consisting of a certain number of terms which can be represented as dk = (t1, t2, …, tn). The probability that the document dk belongs to category ci can be defined as Eq. (6), in which P(dk) denotes a constant for all documents. More details can be seen in Farid et al. (2014) and Ilinskas and Litvinas (2020). Furthermore, default parameter settings for NB classifier in this research.

$$ P(c_{i} {\text{|d}}_{k} ) = \frac{{P(c_{i} )\prod\limits_{j = 1}^{n} {P(t_{jk} {\text{|c}}_{i} )} }}{{P(d_{k} )}} $$
(6)

4.4 Evaluation of the performance

In order to verify the effectiveness of the improved methods, the TC performance need to be evaluated with a certain standard of measurement. Among masses of evaluation indexes, precision and recall are two popular measures for evaluating the performance of TC task. Precision denotes the proportion of correct assignments among all the test documents that should be assigned to the target category and recall represents the proportion of correct assignments among all the test documents assigned to the target category. However, neither of them can be directly used to evaluating the performance for the reason that the higher level of one indicator may be obtained at the expense of sacrificing the level of the other one. As a result, a new measure named F1 was proposed, in which precision and recall are combined together and assigned the same importance. The precision, recall and the F1 measure can be defined as follows and the explanations of corresponding notations in the formula are listed in the.

Table 5.

$$ P(c_{i} ) = \frac{{{\text{TP}}(c_{i} )}}{{{\text{TP}}(c_{i} ) + {\text{FP}}(c_{i} )}} $$
(7)
$$ R(c_{i} ) = \frac{{{\text{TP}}(c_{i} )}}{{{\text{TP}}(c_{i} ) + {\text{FN}}(c_{i} )}} $$
(8)
$$ F(c_{i} ) = \frac{{2 \cdot P(c_{i} ) \cdot R(c_{i} )}}{{P(c_{i} ) + R(c_{i} )}} = \frac{{2 \cdot {\text{TP}}(c_{i} )}}{{2 \cdot {\text{TP}}(c_{i} ) + {\text{FP}}(c_{i} ) + {\text{FN}}(c_{i} )}} $$
(9)
Table 5 Contingency table for category ci

In general, the F1 measure is estimated from two ways, micro-averaged F1 (micro-F1) and macro-average F1 (macro-F1). The macro-F1 and micro-F1 can be defined as Eq.(10) and Eq.(11).

$$ {\text{macro}{-}}F1 = \frac{1}{m}\sum\limits_{i = 1}^{m} {F_{1} (c_{i} )} $$
(10)
$$ {\text{micro}}{-}F1 = \frac{{2 \cdot \sum\nolimits_{i = 1}^{m} {{\text{TP}}(c_{i} )} }}{{2 \cdot \sum\nolimits_{i = 1}^{m} {{\text{TP}}(c_{i} )} + \sum\nolimits_{i = 1}^{m} {{\text{FP}}(c_{i} )} + \sum\nolimits_{i = 1}^{m} {{\text{FN}}(c_{i} )} }} $$
(11)

5 Experiment results and analysis

In this section, orthogonal experimental design will be firstly described, by which one of ETF-RF and TF-ERF will be proved to be more helpful to the improvement of the initial TF-RF. Then, the chosen method (i.e., TF-ERF or ETF-RF) will be compared with other existing term weighting methods (listed in Table 2) in order to verify its effectiveness for the improvement of TC performance.

5.1 Performance comparisons between TF-ERF and ETF-RF

The first group of experiments is to distinguish which one of TF-ERF and ETF-RF is more helpful to the improvement of the initial TF-RF. Orthogonal experimental design is a kind of experimental design method to study multi-factors and multi-levels. According to orthogonality, some representative points are selected from the comprehensive test to carry out the test. The main tool of orthogonal test design is the orthogonal table, testers can seek the corresponding orthogonal table according to the requirements of the number of factors, the level of factors and whether there is interaction, and then select some representative points from the comprehensive test to test based on the orthogonality of the orthogonal table. In this study, the orthogonal table L25 (56) is selected to arrange the combinations of parameters at different level. Through the experiment and analysis of different parameter sets of ktf and krf, it can be obtained that when ktf and krf take values in the {1,2,3,4,5} set, the proposed weight distribution model has good classification results for the two test text sets, so this set is selected as a specific parameter set. When ktf and krf are within the selected parameter set, the term weighting model proposed in this paper has good classification performance and robustness for different text sets.

The figure presents the performance obtained on Reuters-21578 dataset and WebKB dataset separately. It can be seen from the figure that the classification performance will deteriorate rapidly with the increase in ktf. In contrast, the change of krf doesn’t have great influence on the performance of TC as ktf shows, while the increase in krf presents a positive impact on the improvement of the classification performance. This proves that the improved method TF-ERF is beneficial to the improvement of initial TF-RF model, which is consistent with the analysis result mentioned in Sect. 3.3.

In terms of Fig. 4, taking both macro-F1 and micro-F1 into account, the best classification performances can be observed under ktf = 1 and krf = 5 for the Reuters-21578 dataset as well as ktf = 1 and krf = 2 for the WebKB dataset. Therefore, the parameters added to TF-ERF is determined, and the improved method TF-ERF will be compared with other term weighting schemes in the following experiments.

Fig. 4
figure 4

Classification performance according to different parameters ktf and krf under two corpora

5.2 Performance comparisons of existing methods

The second group of experiments are to verify the effectiveness of TF-ERF by comparing its performance with other term weighting schemes listed in Table 2. The classification experiments are carried out on two text test sets, Reuters-21578 corpus and WebKB, which has been introduced before. The two text test sets get 8000 and 7000 features separately after feature selection. In order to reflect the text classification performance of each term weighting model on different feature numbers more succinctly and intuitively, corresponding analysis of the number of features is carried out. Finally, it is determined to divide the total features under two test text sets with the step size of 1000.

Figure 5 shows the experiment performance of TC on the Reuters-21578 corpus. It can be seen from the figure that TF-IDF and TF-IDF-ICSDF present the worst performance in all feature sets for both macro-F1 and micro-F1. Especially at a small feature set (less than 200), the performance of the two schemes is far worse than the other ones. Meanwhile, the rest schemes perform well even when the number of features is small.

Fig. 5
figure 5

Macro-F1 and Micro-F1 measure of the eight weighting approaches on the Reuters-21578 corpus with different numbers of features

In terms of macro-F1, almost all term weighting schemes reach their peaks at a feature set around 1000 and TF-ERF obtain the best performance at the peak compared with others. On the whole, TF-ERF is superior to other schemes when the number of features is less than 7000. In addition, it can be seen that TF-RF does not show advantages over other schemes and even inferior to TF-CHI2 and TF-MI in most feature sets. By contrast, it is obvious that TF-ERF is very effective in improving the performance of TC as an improved method of TF-RF.

In terms of micro-F1, these schemes do not reach the peak at the same feature set as the figure of macro-F1 shows. But there are also corresponding turning points at a feature set around 1000. After that, the micro-F1 of some schemes begin to decrease like TF-IDF and TF-IDF-ICSDF, while the rest schemes continue to maintain an increasing trend at a slower speed. It can be seen that TF-ERF also shows evident advantage over other schemes for most feature sets in addition to the situation when the number of features is 800. TF-ERF reaches its peak when the number of features is around 5000 and the peak presents the best performance of all the term weighting schemes for all the features.

Figure 6 shows the experimental performance of TC on the WebKB corpus. It can be seen from the figure that most schemes show poor performance when the number of features is relatively small, which is different from what Fig. 5 shows. In addition, TF-ERF does not present great advantage over other schemes as it shows in Fig. 5. However, it cannot be neglected that TF-ERF has a better performance in a certain feature set and the improvement relative to the initial TF-RF.

Fig. 6
figure 6

Macro-F1 and Micro-F1 measure of the five weighting approaches on the WebKB corpus with different numbers of features

In terms of macro-F1, it can be seen that TF-CHI2, TF-MI and TF-OR almost maintain the growing trend with the increase in the number of features, and other schemes start to decrease when reaching their respective peaks. As far as TF-ERF is concerned, it is superior to other schemes when the number of features falls in [200, 3000]. After the number of features exceeds 3000, the performance of TF-ERF begins to deteriorate and be surpassed by TF-MI, TF-OR and TF-CHI2 successively. However, we should notice that TF-ERF reaches its peak at a feature set around 2000 and the performance of this point is the best among all the schemes. In addition, as an improved method of the initial TF-RF, it also shows a better performance than TF-RF for the whole feature sets.

In terms of micro-F1, the curves become much tighter compared with the left figure which means that the gap of performance between different schemes has become smaller. From the figure we can see that TF-IGM, TF-RF and TF-ERF obtain better performance than the rest schemes for almost the whole feature sets. There is almost no difference in the performance of these three schemes when the number of features is less than 1500. After that they all present a descent trend one after another. Among them, due to the attenuation of TF-ERF is less than the other two schemes, so the performance of TF-ERF is superior to TF-RF and TF-IGM. TF-ERF reaches its peak at a feature set around 3000 and it also represents the best performance of all these eight schemes for all feature sets.

From the above two groups of experiments, it can be seen that the proposed term weighting method TF-ERF achieve better TC performance compared with other methods in most feature numbers. Although the performance of TF-ERF is surpassed by partial methods in some specific feature sets. It is undeniable that TF-ERF still has obvious advantages on the whole. Meanwhile, the best TC performances are always obtained by TF-ERF, which proves that the TC ability of TF-ERF has a higher upper limit. Furthermore, it should be noted that all the term weighting methods have poor TC performance when using fewer features for TC experiments. This shows that most useful features have been filtered out, and the remaining features are not enough to support the TC task to obtain a better classification result. For different term weighting methods, the changing trend of their TC performance is various with the increase in features, and this trend will also be influenced by the selected text set. For example, in terms of the text classification experiment under the Reuters-21587 corpus, the best performances of most term weighting methods are obtained around 1000 features. After the number of features exceeds 1000, most of them show an obvious downward trend. For WebKB dataset, the best performances of most term weighting methods are around 3000 features. After the number of features exceeds 3000, the TC performances of most of these methods show a weak downward trend, but there are exceptions, such as TF-CHI2, TF-MI, TF-IDF-ICSDF. With the increase in the number of features, the performances of these three methods are on the rise. And for the improved term weighting method TF-ERF, we can see that the changing trend of TC performances under different numbers of features are less affected by different text sets compared with other methods, and the best classification performances are all achieved by TF-ERF, which proves that TF-ERF has good text classification ability and robustness.

5.3 Performance improvements of proposed method over others

In order to understand the performance of the proposed scheme compared with other existing schemes more accurately, the specific experimental data of each scheme for Reuters-21578 dataset and WebKB dataset are listed in Tables 6 and 7 separately. The data in parentheses are the improvement of TF-ERF relative to other schemes. Among them, the data with the best performance in each chosen feature set are presented with boldface. In addition, the data with the best performance in the whole feature set is shown in bold and underlined.

Table 6 Macro-F1 and micro-F1 for reuters-21578 dataset
Table 7 Macro-F1 and micro-F1 for WebKB dataset

It can be seen from Table 6 that most of the best performances of selected feature set focus on TF-ERF. For both macro-F1 and micro-F1, the best performance of TF-ERF is also the best performance of all the eight term weighting methods for the whole feature sets. As an improved method of TF-RF, the performance of TF-ERF is superior to that of TF-RF for each feature set. On the whole, the improvement growth relative to TF-RF raises with the increase in the number of features except for some special points. For Table 7, although TF-ERF doesn’t show strong advantage over other schemes as Table 6 shows, in most cases, TF-ERF is still superior than other methods. Besides, what has not changed is that the best performance of TF-ERF still represents the optimal level of all the selected term weighting methods for the whole feature sets. In summary, the improved term weighting method TF-ERF, shows a better text classification performance over other term weighting methods and better robustness for text sets with different characteristics.

6 Conclusion

In this study, considering that the logarithmic processing of the global weight borrowed from TF-IDF may not adapt to other term weighting methods, e.g., TF-RF. So two improved term weighting methods based on TF-RF were proposed to explore this problem. First of all, two assumptions were given to explain the problems faced by TF-RF. In terms of the two assumptions, feasible improvement strategies were identified and then applied to them separately. As a result, two methods named TF-ERF and ETF-RF were proposed. Then, through orthogonal experimental design, the improved term weighting method TF-ERF, which holds the assumption that the local part TF of TF-RF suppresses the impact of the global part RF in contributing the discriminating power to the selected term, was proved to be more helpful to the improvement of TC than the other one. Meanwhile, the parameters added to TF-ERF was also determined. After that, TF-ERF was compared with seven existing representative term weighting methods to evaluate the text classification performance, the experimental results proved that TF-ERF have a better TC ability over the listed term weighting methods including TF-RF.