Keywords

1 Introduction

Text mining also known as text data mining or knowledge discovery in text (KDT) is the base of extracting high-quality information from raw data or text. High quality means the information should be according to user’s need. Text classification is an active research field of text mining. As computers take text as sequence of strings, they can’t extract useful information. So, specific algorithms and techniques should be used for preprocessing of raw data in order to get useful information or patterns [1].

Text (document) classification is the active research area of text mining in which assigning of text documents into classes or categories is done [2, 3]. These text documents include letters, newspapers, articles, blogs, proceedings, journal papers, etc. Text categorization means dividing a set of input documents into the two or more classes to which these documents belong. Because of increase in availability of data in digital form in large amount, it becomes necessary to organize it.

Text classification techniques can be divided into two categories: supervised document classification and unsupervised document classification (or document clustering). Supervised classification is one in which for defining the classes and classifying the documents, an external mechanism (e.g., human feedback) provides the information. Supervised machine learning techniques like Support Vector Machine, k-Nearest Neighbors, Naive Bayes, Decision Tree are applied frequently in text classification [1].

In unsupervised classification, the system doesn’t have any pre-defined classes and it works without any external reference. Classification mode can also be semi-supervised in which some documents are pre-classified by external means for better learning of classifier. k-means, hierarchical clustering, etc., are commonly used as unsupervised learning techniques in text classification.

Text classification is divided into two phases: training phase and testing phase as shown in Fig. 1. Set of pre-classified or labeled documents D = {d 1, d 2, d 3, …, d n } as training set is belonging to set classes C = {c 1, c 2, c 3, …, c p }. The training set is used for machine learning, i.e., to train the classifier. Depending upon the features selected, classifier is trained and classification algorithm is defined. The set of unlabeled documents referred as test set is used to test the classifier’s accuracy by comparing the result driven by classifier for known label of document in the test set.

Fig. 1
figure 1

Text classification steps

In this paper, we propose a probabilistic approach for text classification. It generates the domain ontology for each pre-defined class using training dataset. This model needs no human intervention in the process of ontology learning. Here, DOG is generated using MCL algorithm to train the classifier. The rest of the paper is composed of background, detailed methodology used for generating the DOGs and text classification algorithm, observations, conclusions and future scope, and limitations.

2 Background

2.1 Ontology

Ontology is basically a representation of real world’s knowledge. Ontology defines a set of representative parameters for designing the model of domain of knowledge. These representational primitives are in machine readable format and are understandable by the human beings also [4, 5]. These formats are composed of attributes (properties), classes, and relationship among them. Ontology helps to develop knowledge-based systems, like Web search engines, text classification systems, content management systems, very effectively and efficiently. Ontology helps in real-time applications also. So we can conclude that ontology can be widely used as standard for semantic-based Web systems.

2.2 Ontology Learning

Ontology learning from textual data is very useful method as text data is the real source of human knowledge. Analyzing textual data requires some natural language processing approach [6, 7]. In recent years, for ontology learning most researchers have used artificial intelligence approaches like machine learning or statistical analysis approaches. The knowledge in textual data is implicit and vague, but these techniques compute knowledge explicitly. So there are difficulties in both quality and quantity.

2.3 Text Classification

Text Classification assigns class to the unlabeled documents. It is a task of assigning a value to every \(\left( {d_{i} \times c_{j} } \right) \in D \times C\); here, D is the set of all unlabeled documents, defined as \(D = \left\{ {d_{1} ,d_{2} ,d_{3} , \ldots ,d_{n} } \right\}\) and C is domain of all pre-defined categories defined as \(C = \left\{ {c_{1} ,c_{2} ,c_{3} , \ldots ,c_{p} } \right\}\). The main target of TC is to approximate the value of function \(\oslash :D\times C \to \left\{ {T,F} \right\}\). The value \(\left\{ {d_{i} \times c_{j} } \right\} \to T\) indicates that \(d_{i}\) belongs to class \(c_{j}\), and the value calculated as \(\left\{ {d_{i} \times c_{j} } \right\} \to F\) indicates that \(d_{i}\) doesn’t belongs to class \(c_{j}\) [8].

Most of the existing techniques [9,10,11,12,13] for text classification are lacking in finding the relation among the different terms of the document belonging to particular class. Sometimes, the results are biased and give error while classification. So relation among different terms of a class is very important point for classification and thus making the ontology. As existing system for text classification is not considering the term relation and treating every term as a unique identity for classification, error rate is high in them. If some systems have used the ontology for relation of terms, they are very complex and not much efficient. Ontology-based text classification improves the traditional system performance in terms of accuracy and also reduces the problem of over fitting. In this paper, we propose a probabilistic approach for text classification. It generates the domain ontology for each pre-defined class using training dataset. This model needs no human intervention in the process of ontology learning. Here, DOG is generated using MCL algorithm to train the classifier.

3 Methodology

Text classification process is divided into two phases: training phase and testing phase. In training phase, DOG is generated using feature extraction and MCL algorithm. During testing phase, text classification is done for unlabeled documents.

To model the ontology of knowledge in domain, ontology graph approach is used by the knowledge seeker system. Ontology graph is made up of four levels of conceptual units (CUs), linked together by different types of associations. The four CUs can be defined as:

  • Term T: the smallest conceptual unit extracted from the text which is relevant to the user’s need.

  • Concept C: grouping up of related terms together with conceptual relation (CR) build the concepts, these are the basic units for concept graph (CG).

  • Concept Cluster CC: group of related clusters form a concept cluster CC. It tightly binds up the clusters to form hierarchy of knowledge.

  • Ontology Graph: grouping up of all CC forms a big and largest cluster of knowledge, termed as ontology graph.

3.1 Ontology Learning

At this stage, the domain of knowledge in the form of DOG is created. Graph creation is a knowledge-extraction process. A bottom–up approach is defined for extracting the features and designing the DOG in the form of CU (cluster unit) and CR (cluster relations). Bottom–up means the extraction is started from the smallest unit, i.e., term T and it ends up with the highest level, i.e., DOG. The five learning sub-processes are defined for ontology learning [14, 15]. These are the following:

  1. I.

    Term Extraction: It is the process in which all the relevant terms are extracted from the dataset. A candidate term list T: \(\left\{ {t_{1} ,t_{2} ,t_{3} , \ldots ,t_{n} } \right\}\) is extracted by eliminating the irrelevant terms from the text corpus. Stop word removal, stemming and lemmatization are done at this step.

  2. II.

    Term-to-Class Relationship Mapping: The next step is term-to-class relationship mapping. The term-to-class mapping is done by using the nonlinear relation among the term and classes mutual information and information entropy is used for mapping. The information entropy for each term t and class c is calculated using Eq. 1.

$$H\left( X \right) = - \sum\limits_{x \in X} {P\left( x \right)\log p\left( x \right)}$$
(1)

Then, mutual information among the term and class is calculated using Eq. 2.

$$I(t|c) = - \sum\limits_{c \in C} {\sum\limits_{t \in T} {P\left( {t,c} \right)\log \frac{p(t|c)}{p\left( t \right)p\left( c \right)}} }$$
(2)

Then, R(t, c) relationship factor is calculated as:

$$R\left( {t,c} \right)= \frac{2I(t|c)}{H\left( t \right) + H\left( c \right)}.$$
(3)

If,

\(R\left( {t,c} \right) < 1\): term a is negative dependence and not considered in the class c.

\(R\left( {t,c} \right) > 1\): then term t is positive dependence and considered in the class c.

All the terms having negative dependence on class are ignored. And the terms having positive dependence are considered for class. So term lists are prepared for all the pre-defined classes/categories in this step.

  1. III.

    Term-to-Term Relationship Mapping: Further interrelationship among the different terms in class is measured by term-to-term relationship mapping. Similarly, the term-to-term mapping is done by using the R-factor value as given by Eq. 4.

$$R\left( {t_{a} ,t_{b} } \right) = \frac{{2I(t_{a} |t_{b} )}}{{H\left( {t_{a} } \right) + H\left( {t_{b} } \right)}}$$
(4)

Here, we will get the relationship factor among the terms extracted at previous step for each class. The R-factor visualization for two classes as an example for medical and space class is as shown in Fig. 2. In this figure, only first 11 terms are considered for each class.

Fig. 2
figure 2

Term-to-term relationship mapping for medical class

  1. IV.

    Concept Clustering using MCL algorithm: at this step, the graph generated at previous level is clustered into tight semantic-related group. In this paper, Markov clustering algorithm [16] defined as Algorithm 1 for graph clustering is used.

  1. V.

    DOG Generation: using the different concepts and relations defined in previous levels, DOG is generated. The node having maximum relation value among all terms in a cluster is selected as a label for that cluster.

3.2 Text Classification

Text classification is achieved by finding the similarity of unlabeled document with the DOG. For this, text classification process is comprised of three steps: (1). Generation of DocOGs for the unlabeled document corresponding to each pre-defined class DOG. (2) Deriving the score vector of document for each class. (3) Select the class having highest score as category for the document.

3.2.1 Generation of DocOG

DocOG will be created by using the DOGs generated at previous level. An unlabeled document is input to the system, and then DocOGs of this document corresponding to all pre-defined classes are generated. Algorithm 2 is used to generate DocOG

3.2.2 Score Vector Calculation

Here the score vector for the document as given by Eq. 5.

$${\text{Score}}\left( {{\text{doc}},{\text{DOG}_{j}}} \right) = {\text{score}}\left( {{\text{DocOG}},{\text{DOG}_{j}}} \right)$$
(5)

\(S = \left\{ {S_{1} ,S_{2} ,S_{3} , \ldots ,S_{n} } \right\}\) is calculated for all the n-pre-defined categories. These scores are calculated by finding the number of nodes matching of all DocOGs with corresponding DOG.

3.2.3 Category Selection

After obtaining the score vector of all DocOGs here, comparison among the different scores for classes is done and the document is assigned to the class having highest score, i.e., the highest scored DocOG is selected as classified domain.

4 Software and Dataset

This proposed approach is implemented in Linux operating system using java, python, and C. Twenty Newsgroups dataset is downloaded from Jason Rennie’s page. This dataset is a collection of 20,000 newspaper documents approximately, partitioned in 20 categories. This dataset is freely available. We have filtered the documents of only 12 classes, i.e., Advertisement, Automobile, Computers, Cryptography, Electronics, Games, Medical, Politics, Religion, Science, Graphics, and Windows for our research from these 20 categories dataset. Each file contains on an average of 70 words. We have used different number of classes for comparison and for checking the efficiency of classifier. First we had taken 8 categories then, we took 10 categories and then we had taken 12 categories for text classification. This variation of classes is done in order to check the effect of number of categories on classification power of classifier (Fig. 3).

Fig. 3
figure 3

Number of positive terms for each class

4.1 Performance Measures

Error rate is the performance measure used to evaluate the classifier efficiency and accuracy. In this precision, recall and f-measure [8] are the basic performance measuring parameters. These can be defined as following (Table 1).

Table 1 Distribution of dataset

4.1.1 Precision

Precision is also known as positive predictive value. It calculates the accuracy by finding the percentage of documents correctly retrieved to the total retrieved documents.

$${\text{Precision}} = \frac{\text{TP}}{{\text{TP} + {\text{FP}}}}$$
(6)

Table 2 is the contingency table, in which TP is True Positive, FP is False Positive, FN is False Negative, and TN is True Negative.

Table 2 Contingency table
  • TP: number of documents correctly labeled as belonging to positive class.

  • FP: number of documents incorrectly labeled as belonging to the class.

  • FN: number of documents which are not labeled as belonging to the positive class but should have been.

  • TN: number of documents which are correctly labeled as not belonging to the class.

4.1.2 Recall

Recall is also known as sensitivity. It calculates the ability of the classifier by measuring the percentage of correctly classified documents to the total classified documents.

$${\text{Recall}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}}$$
(7)

4.1.3 F-measure

It is the measure of harmonic mean of precision and recall. It gives the closeness between precision and recall. It is defined by as mentioned in Eq. 3.

$$F{\text{-measure}} = \frac{{2 \times {\text{precision}} \times {\text{recall}}}}{{{\text{precision}} + {\text{recall}}}}$$
(8)

4.2 Experimental Results

The proposed algorithm for text classification is implemented and compared with Naive Bayes and k-Nearest Neighbor classifier. Naive Bayes and k-Nearest Neighbor classifiers are implemented in Python using the inbuilt library “sklearn.” In k-Nearest Neighbor algorithm, ten nearest neighbors are considered for measuring the distances in classification.

The three classifiers are implemented using 8, 10, and 12 classes or categories to measure the performance of classifiers effectively. It is done to evaluate and compare the effect of number of categories on the classification power of the classifier. To evaluate the power of classifiers, the comparison is done using precision, recall, and f-measure.

4.2.1 Experiment 1

In Experiment 1, we have considered the number of categories N = 8 for text classification. These are Automobile, Electronics, Religious, Sports, Medical, Cryptography, Science, and Politics. Then, the performance is evaluated using precision, recall, and f-measure. Table 3 shows the values of precision, recall, and f-measure for different models using number of classes N = 8. Figure 4 gives the representation for comparison of f-measure for different classifiers for different classes.

Table 3 Precision, recall, and f-measure for N = 8
Fig. 4
figure 4

Representation for comparison of f-measure for N = 8

The accuracy power for DOG is 91%, while those of Naive Bayes are 84% and that of k-NN is 77%. This f-measure value shows that the DOG proposed model performs better than other two classifiers. k-NN has lowest accuracy. And k-NN has lower classification power as compared to others. Proposed model shows maximum of 97% accurate results for class Electronics and minimum of 81% for class Science. This result shows that proposed model gives better result as compared to the other two techniques. This comparison can also be expressed using graphical representation. Figure 4 shows the graphical representation for comparison of the three techniques such as proposed model, Naive Bayes, and k-NN algorithm for text classification.

4.2.2 Experiment 2

In Experiment 2, we have considered the number of categories N = 10 for text classification. These are Automobile, Electronics, Religious, Sports, Medical, Cryptography, Science, Politics, Advertisement, and Computer. Then, the performance is evaluated using precision, recall, and f-measure. Table 4 shows the values of precision, recall, and f-measure for different models using number of classes N = 10. Figure 5 gives the representation for comparison of f-measure for different classifiers for different classes. The accuracy power for DOG is 88%, while those of Naive Bayes are 82% and that of k-NN is 73%.

Table 4 Precision, recall, and f-measure for N = 10
Fig. 5
figure 5

Representation showing comparison of f-measure for N = 10

This f-measure value shows that the DOG performs better as compared to other two classifiers. k-NN has lowest accuracy. And k-NN has lower classification power as compared to others. Proposed model shows maximum of 95% accurate results for class Sports and minimum of 79% for class Cryptography. This result shows that proposed model gives better result as compared to the other two techniques. This comparison can also be expressed using graphical representation. Figure 5 shows the graphical representation for comparison of the three techniques such as proposed model, Naive Bayes, and k-NN algorithm for text classification.

4.2.3 Experiment 3

In Experiment 3, we have considered the number of categories N = 12 for text classification. These are Automobile, Electronics, Religious, Sports, Medical, Cryptography, Science, Politics, Advertisement, Computer, Graphics, and Windows. Then, the performance is evaluated using precision, recall, and f-measure. Table 5 shows the values of precision, recall, and f-measure for different models using number of classes N = 10. Figure 6 gives the representation for comparison of f-measure for different classifiers for different classes.

Table 5 Precision, recall, and f-measure for N = 12
Fig. 6
figure 6

Representation showing comparison of f-measure for N = 12

The accuracy power for DOG is 85%, while those of Naive Bayes are 79% and that of k-NN is 69%. This f-measure value shows that the DOG proposed model performs better than other two classifiers. k-NN has lowest accuracy. And k-NN has lower classification power as compared to others. Proposed model shows maximum of 92% accurate results for class Medical and Cryptography, and minimum of 76% for class Politics. This result shows that proposed model gives better result as compared to the other two techniques. This comparison can also be expressed using graphical representation. Figure 6 shows the graphical representation for comparison of the three techniques such as proposed model, Naive Bayes, and k-NN algorithm for text classification.

4.2.4 Experiment 4

Experiment 4 shows the average value of precision, recall, and f-measure for all the three classifiers (Table 6).

Table 6 Average value of f-measure for all classifiers

Figures 7 and 8 show the graphical representation of average f-measure for all the three classifiers. These show that with increase in number of categories, the accuracy of the classifier decreases. As we can see that the f-measure value decreases for every classifier with increase in value of N. By comparing the average values, it is also proved that the proposed classifier is more accurate than traditional Naive Bayes and k-NN approaches for text classification.

Fig. 7
figure 7

Comparison of average f-measure value

Fig. 8
figure 8

Line-graph representations for f-measure comparison

Figure 8 shows using line-graph representation how the accuracy power of different classifiers decreases with increase in number of categories. From this figure, it is also concluded that proposed classifier is the more accurate among the three classifiers.

5 Conclusion and Future Scope

This proposed scheme, to classify English texts by using probabilistic approach, is a fully automatic system. We just give the dataset and pre-defined classes as input to our system. DOG with hierarchical clustering was used for Chinese text. It is for the first time that DOG model with MCL clustering is used for English text. DOG model increases the classification power. Effective feature extraction is done by considering the nonlinear relations among the terms. Here, the domain ontology graph model is designed to generate the knowledge representation and MCL clustering algorithm is used to cluster the terms of the graph. The use of MCL algorithm makes the system efficient as it is mathematical approach, so is more accurate. This approach is scalable to huge dataset also and its classification power is not affected if relations among terms are large. But there are limitations also. We have not used the synonyms and antonyms while designing the ontology. Also, in MCL clustering, we perform the matrix multiplication as we are using the mathematical probabilistic approach. This matrix multiplication is of O(n 3). So it is highly complex system.

This work has devised a text classification system. It is probabilistic approach for classifying the texts. We have used the DOG model with MCL clustering algorithm for English text classification. Experimental results have proved that it is an accurate and effective text classifier. This DOG model is used for the first time for text classification. But there are many things to do. In near future, this work can be extended. Some of the things which can be done are:

  • Synonyms and antonyms can also be added while generating the domain ontology graph.

  • Semantic-based learning approach can also be used in future for improving the system.

  • It can also be applied to other languages.