Keywords

1 Introduction

Text categorisation or text classification is a sub-branch of text mining and natural language processing technique which attempts to assign documents to specific categories. Text categorisation is important in many applications such as the automated generation of metadata in document retrieval systems, question answer systems and search engines (Sahin 2007). Text documents are usually represented using the bag-of-words approach and vector space model. In this representation, a document or a document set can have thousands of vectors where each vector is usually a tuple consisting of a word and its term frequency. Because of the very high dimensions inherent in textual data, feature selection is one of the key steps in text categorisation.

Given a set of n words, it still possible to generate 2n – 1 subsets of 1 to n words and \( \frac{n!}{{r!\left( {n - r} \right)!}} \) possible subsets of size r. Thus, there are more than 47 trillion ways of choosing 30 words form a bag of 50 words. So, for a given document set, it is usually not possible to consider all possible combinations of terms in order to find the best one. Therefore, when no exact formulation of a problem is possible, it is necessary to resort to approximations. In text mining, dimension reduction is usually achieved by converting all text to lowercase, removal of stopwords, stemming and lemmatisation. However, this is often not sufficient and the dimension often remains unusually large.

In this paper, we investigate three different methods which are all based on the principles of Markov chain Monte Carlo (MCMC) methods. These Monte Carlo techniques, due to their general applicability, has been used in a large variety of domains especially for studies involving simulations in physics, biology and computer science (Browne et al. 2012). Andrieu et al. (2003) wrote an interesting paper which bridges the gap between the Monte Carlo methods and traditional machine learning techniques. The basic Monte Carlo techniques and its various variance-reduction variants and their applications are described in detail. A theoretical foundation of the different techniques are also provided.

The use of simulated annealing and genetic algorithms as feature selection methods in the text mining field are scarce compared to other fields but not new. Genetic algorithms have been used in information retrieval systems by Gordon (1988) and Chen and Kim (1994). Although simulated annealing (Metropolis et al. 1953) is an older technique than genetic algorithm (Holland 1975), its use as a feature selection method in text classification is relatively recent (Wang et al. 2006; Yang et al. 2007).

In this work, inspired by the principles of Monte Carlo, we are able to show that by repeatedly determining the category of a document from a sample of features selected through a random walk or differential evolution, we are able to deduce the category of a court judgement to a higher accuracy using only a very small fraction of the total number of features. The use of the Monte Carlo approach allows a micro-local analysis to be performed on a high-dimensional problem. We believe that this new technique can become a tool of general applicability in the field of text classification.

The performance of machine learning classifiers hinges on the availability of large amounts of training data. Manual classification of court cases requires highly skilled professionals and is a time-consuming and costly undertaking. Through passive selection, we divided our dataset into several partitions of training set (10–90%) and testing set (90–10%). In the future, we intend to use the principles of active learning to optimise the choice of the training samples (Roy and McCallum 2001; Figueroa et al. 2012). Although support vector machines performs better on a 90–10 split, our Monte Carlo-based classification is able to achieve higher accuracies when the training set is less than 80% of the full dataset.

The remainder of this paper is organised as follows. Section 2 discusses on prior research on Markov Chain Monte Carlo methods, simulated annealing and genetic algorithms. Section 3 describes in detail how these methods have been adapted for feature selection and text classification in our system. The experimental results and discussions are provided in Sect. 4. And finally, the conclusion is offered in Sect. 5 wherein some potential avenues for future research is indicated.

2 Related Works

For a modern and comprehensive literature review on the field of text document classification, the reader is referred to Khan et al. (2010). The paper starts with describing state-of-the-art applications of text mining and document classification. The pre-processing steps such as feature extraction, feature selection, dimensionality reduction and document representation are also explained. Machine learning techniques such as kNN, naïve Bayes, decision trees, support vector machines, artificial neural networks, fuzzy logic, genetic algorithms, classification rules and hybrid techniques have also been well tackled in reasonable depth.

The Monte Carlo method find its roots in the field of statistics in which random numbers are used to perform simulation (Liang and Wong 2000; Goncharov et al. 2007; Houghton et al. 2014). This method is often used in problems involving very high dimensions, for example, in the Travelling Salesman Problem where there is a very large number of potential solutions. The problem becomes rapidly intractable because of the exponential rise in the number of paths for each new node that is added to the existing network (Buxey 1979; Martin et al. 1991).

Monte Carlo simulation approaches has found wide applications in computer games (Browne et al. 2012). An algorithm known as the Monte-Carlo Tree Search Solver (MCTS-Solver) has been successfully applied in the popular game of Line of Action in order to find better strategies to play the game (Winands et al. 2008). It was shown that a program using the MCTS-Solver was able to win in 65% of matches.

An interesting application of the Monte Carlo Search Framework has been described in Branavan et al. (2012). In particular, they used a Monte Carlo approach to provide textual information retrieved from a manual to feed a game agent. This game-playing agent was able to outperform the uninformed and untrained agent (default AI) in the game of Civilization by a factor of 34%.

Another modern area where Monte Carlo simulations are being heavily used is in the field of bioinformatics. Ebbert et al. (2011) have used the Monte Carlo approach to generate samples for characterising uncertainty in multi-variate assays. This uncertainly information is very important for clinicians in order to properly classify different types of tumours. Diaconis (2009) has used the principles of Markov Chain Monte Carlo in order to decrypt messages. More recently, Monte Carlo methods have been applied in multi-label classification (Read et al. 2014).

The concept of an evolutionary Monte Carlo method is not new. Previously, it has been used in the scientific, engineering and mathematical filed for the estimation of various parameter values (Smith and Hussain 2012) and uncertainties in optimisation problems (Ter Braak 2006; Wu et al. 2006; Xiao 2007).

One of the earliest works which applied a genetic algorithm (GA) for the extraction of correlated terms was done by Desjardins et al. (2005). However, they concluded that the co-occurrences found by the GA did not improve the retrieval accuracy and more work was required to understand the cognitive factors which would improve the relevancy of retrieved results.

Gavrilis et al. (2005) have used a GA for feature selection on 650 PUBMED abstracts spread into 5 categories. Using an SVM classifier, they achieved 85% accuracy using 20 features only. In another similar study on spam emails, they report an accuracy of 97%, again using only 20 features (Gavrilis et al. 2006).

A comparison was made between tf-idf (term frequency-inverse document frequency) and genetic algorithms by Khalessizadeh et al. (2006) for the identification of document topics in relatively short Persian texts. While precision values were very similar for both methods, GA did better than tf-idf on recall for all document sizes.

Pietramala et al. (2008) proposed the Olex-GA algorithm which assigns positive and negative rules to each feature (gene) in a chromosome. Their approach performed better on the OHSUMED dataset when compared with techniques such as naïve Bayes, C4.5, Ripper and Support Vector Machines. However, for the Reuters-21578 dataset, SVM did significantly better than Olex-GA.

Song and Park (2009) have used a genetic algorithm to find the optimal number of clusters in the Reuters-21578 dataset. The number of terms were reduced using latent semantic indexing (LSI) before the GA was applied. They were able to show that their algorithm performs better than previous methods. However, their study was based on only 1000 documents from 5 categories. Liu and Fu (2012) used an elitist GA to classify 100 web pages into 5 categories and obtained better performance than when using SVM with default parameters.

Chen et al. (2013) proposed a novel text classification procedure based on the chaos optimization theory and genetic algorithm. They tested their approach on the Reuters-21578 dataset and they were able to demonstrate that the algorithm requires a small feature set in order to provide comparable recall performance compared with earlier higher-dimensional techniques. Using GATE (Cunningham and Tablan 2002) and Weka (Hall et al. 2009), Rogers (2013) implemented a single pipeline for feature selection using genetic algorithms and classification several machine learning classifiers. Pavlyshenko (2014) have used a genetic algorithm to determine the optimal subset of keywords which could be used to identify an authors of English fiction texts.

A good introduction to feature selection in text mining using simulated annealing can be found in Bagheri et al. (2014). They demonstrated that their proposed approach delivered similar performance to chi-squared when tested on a Persian dataset consisting of 7 categories. There were about 800 documents in each category. Zhu et al. (2015) further showed that an improved simulated annealing algorithm (SAA) can select better features than information gain (IG), mutual information (MI) and chi-squared (CHI).

Moshki et al. (2015) tested an extended version of the simulated algorithm (SAGRASP) on a diverse set of data and showed that it was able to do better than FCGRASP (Bermejo et al. 2011). Zhu et al. (2009) successfully combined the genetic algorithm and simulated annealing (SA) to extract protein sequences using OpenMP. They reported that their proposed solution did not get trapped in local maxima and could find global optima faster. Two decades earlier, Esbensen and Mazumder (1994) used a mixture of SA and GA, which they called SAGA, to determine an optimal placement for macro-cells.

3 Description of Algorithms

3.1 Markov Chain Monte Carlo Random Walk (MCMCRW)

A random walk is a random or stochastic process that may consist of a series of random steps (Pemantle 2007; Samad 2013). The principles of random walk has found wide applications in different fields of computer science such as the analysis of computer networks (Zhong et al. 2008), computer security (Zhou 2016), bioinformatics (Draminski et al. 2008) and text classification (Hassan et al. 2007).

In our system, a random walk consists of a sequence of similar operations in which a subset of k elements are randomly selected (with replacement) from a list of n elements from each of the m categories, p number of times. In computer science, this is known as a Markov Chain Monte Carlo (MCMC) process. Each element is a word and these m * n elements (mainlist) are initially selected using term frequency. We have two sets of data: the training set and the testing set. The training set is only used for the extraction of representative elements for each categogy. We do not use a wrapper-style classifier (Jovic et al. 2015) to measure the classifier accuracy, instead we use a naïve classifier based on term frequency. Each of the m sublists is compared to every document in the testing set and the sum of all the k words is computed. The sublist with the highest score is taken as the predicted category. This classifier can be considered as a simplified version of the k-nearest neighbour classifier. A simple majority voting is then carried out on the results obtained after the p iterations.

3.2 Boosted Simulated Annealing (BSA)

The basic principles in the simulated annealing algorithm was described by Metropolis et al. (1953). Their aim was to simulate the movement of atoms at a finite temperature. In 1970, Hastings showed how this method could be generalised to solve problems in statistics. Kirkpatrick et al. (1983) took up the same basic ideas but added the concepts of high and low temperatures and compared the algorithm to the annealing process in metals, from which the algorithm got its name. It is only very recently that the simulated annealing algorithm has been used for feature selection in the text classification field (Wang et al. 2006; Yang et al. 2007; Bagheri et al. 2014; Zhu et al. 2015; Moshki et al. 2015).

In our system, we implemented the standard simulated annealing algorithm but is boosted with a good initial sample which is produced by random sampling through a random walk of 100 steps. The selection of elements is similar to MCMCRW. However, in BSA, the next step is not independent on the current one. The next list is generated by replacing t elements in the current best list by t other elements from each category (sublist) from the mainlist. The number t is a temperature variable which decreases steadily to 0 from the first iteration until the last one. If the accuracy increases after this small change, the best list is updated. However, even if the accuracy decreases by x percent, we still consider this new list as the current one. If the accuracy decreases by x percent or more, the current list is discarded and a new one is generated. This entire process is repeated q times. Our algorithm is also stateful in that it has a memory to store the best list from any of the q iterations.

3.3 Genetic Algorithms

A genetic algorithm (GA) is often described as a meta-heuristic algorithm which emulates the Darwinian’s theory of natural evolution through the biological processes of selection, mating and mutation (Mitchell 1998). Since its formulation in 1970 by Holland, GAs has found wide applications, not only in scientific areas but also in business applications. Thomas and Sycara (2002) have used GAs for predicting stock prices while Borg (2009) has used GAs for the automatic extraction of definitions. In combination with other methods, Waad et al. (2014) used a genetic algorithm to select the best features to assess the credit worthiness of a potential client. A recent and novel application of GA is in the detection of errors in SQL instructions (Moncao et al. 2013). Also, as stated earlier, GAs have also been used in the information retrieval domain since more than two decades (Atkinson-Abutridy et al. 2004; Al-Maqaleh et al. 2012).

In our system, we maintain a population of r mainlist. In GA’s jargon, a mainlist can be considered as a chromosome. A mainlist is a list of sublists and each sublist contains the most frequent terms in one category of documents. As mentioned earlier, each term is a word (gene). A fitness score (classification accuracy) is calculated for each of these r lists. The best b lists (chromosomes) are selected in each iteration for crossover and mutation. Each of these b lists are randomly paired with each other (without duplication) to generate \( \frac{b}{2} \) pairs and c elements are then chosen randomly from one member in each pair and are swapped. This is the crossover operation whereby b new lists are created and the previous b lists are discarded*. Our crossover operation is slightly different from previous approaches that use fixed locations for genes in that we do not exchange a segment of the chromosome, instead, we exchange genes selected randomly from anywhere in the chromosome. The rationale for this heuristic approach is that the genes (words) are independent and this reduces the likelihood of collisions. The next operation is mutation which is achieved by the substitution of d elements in each list by new elements from the mainlists. The remaining r-b lists are rejected and new ones are generated randomly in order to keep the size of the population constant. These processes are repeated q times. The best list from each of these q iterations are stored in a separate memory*.

4 Experiments and Settings

4.1 Experimental Corpus

Our dataset consists of 294 judgements which were delivered in the Supreme Court of the Republic of Mauritius in the year 2013. The cases have previously been classified manually into eight categories: homicide, road traffic offences, drugs, other criminal offences, company law, labour law, land law and contract law. It is the same dataset that we have used previously in an earlier work (Pudaruth et al. 2016).

4.2 Document Pre-processing

Because textual data is inherently noisy, it is important to filter out those elements that would negatively impact on the performance of classifiers. Thus, all the data was first converted to lowercase after which all digits, symbols, short words and stopwords were filtered out. Besides the common English stopwords, the list also included words from the legal domain such as case, act, section, law, court, appellant, respondent, judge and many others. These words tend to occur in almost all judgements and their frequencies are also very high. This is an interesting issue because the same words could have been strong differentiators if our objective was to look for legal documents in a mixture of documents from other domains. With the help of Wordnet (2017), all non-English words and all verbs were removed from the dataset. All the pre-processing steps taken together reduced the feature set from 6048 to 2485 words.

4.3 Experimental Environment and Algorithmic Parameters

The documents are kept in a parent folder with eight directories. Each directory contains the files for one specific category whereby each judgement is stored as a separate textfile. The software for document pre-processing and feature selection have been implemented in Python 2.7.12 (Spyder 3.0.0) from the Anaconda distribution. The scikit-learn library for Python has been used for the machine learning part. A computer running the 64-bit Windows 7 Professional N (SP1) operating system has been used in this study. The processor is an Intel(R) Core(TM) i5-4200 M CPU running at 2.5 GHz on 8.00 GB of RAM on a hard disk of 650 GB.

The number of iterations for each of the 3 Monte Carlo methods was 100. A sample consisted of 30 (k) elements drawn from a larger set of 100 (n) most frequent words from each of the 8 (m) categories. The mutation rate for both the simulated annealing and genetic algorithms was 20%. This means that for every 30 elements, 6 elements were exchanged. The initial temperature for SA was set at 10 (for every sublist) and this was reduced by \( 0.1 \) after every iteration until it reached a minimum value of 1. The SA algorithm was allowed to accept solutions which was 5% (x) worse than the current one in an attempt to avoid local maxima. The crossover rate for the GA was also set at 20%. The parameters were chosen empirically, after conducting a large set of experiments and observing their impact on the accuracy. All the algorithms were run 5 times on each split percentage and an average was made.

4.4 Experimental Results

In this sub-section, we present the detailed results to demonstrate the effectiveness of the Markov Chain Monte Carlo (MCMC) Random Walk algorithm compared to simulated annealing (SA), genetic annealing (GA) and support vector machines (SVM). The dataset has been split into nine different sets of training and testing data, as shown in Table 1, with a view to understand the influence of different training sizes on feature selection and classification accuracy. A comparison with SVM is also provided.

Table 1. Details of cases dataset

In general, classification accuracy increases when the size of the training set increases, as shown in Table 2. For all training sizes, MCMCRW and SVM are more effective than SBA and GA. The best accuracy of 83% is obtained by MCMCRW at 70% of the training set and by SVM at 90% of the training set. When the training size is 20% or less, all the three algorithms (MCMCRW, BSA and GA) does better than SVM (with default settings and parameters). When the training size is between 40 and 70%, only MCMCRW is able to outperform SVM. The results illustrate that our random walk algorithm can produce satisfactory results even with low amount of training data. In the legal field where it is very costly and time-consuming to produce annotated data as this has to be done by highly trained professionals, the random walk algorithm only requires a few relevant documents for training for each category to deliver acceptable results.

Table 2. Classification accuracy v/s training/testing size

Table 3 shows the detailed results for one run on a split of 70/30, in which there were 201 cases in the training set and 93 cases in the testing set. The overall accuracy for this run was 84%. Accuracy is defined as the number of correctly classified documents over the total number of documents in the testing set. The Road traffic offences category has a prefect recall, which means that we have been able to retrieve all instances of this category from the testing set, while the Contract category has the lowest recall because 3 of its cases have been incorrectly retrieved as a Labour case and another two as a Land case. However, these misclassifications are quite comprehensible as these 3 categories share many terms in common as they all deal with contractual issues.

Table 3. Confusion matrix

The Homicide category has the highest precision followed closely by Road traffic offences and Contract. The Labour and Company categories have the lowest precision values. Nine documents have been returned as belonging to the Company category, however, only six of them are actually company law cases. One document belongs to the labour law category while another two belong to the land law category. Again, we see that there is some overlap in features between the Contract, Company and Labour categories. Some additional work will be necessary in order to reduce this mix-up.

The Olex-GA system proposed by Pietramala (Pietramala et al. 2008) did slightly better than SVM on the OHSUMED dataset but less well on the Reuters-21578 dataset. The SA algorithm proposed by Yang et al. (2007) performed slightly better than kNN on some settings. Wang et al. (2006) reported a similar result. However, it is always very difficult to offer a fair comparison when comparing GAs and SAs with machine learning classifiers. The variation in the total number of documents in the dataset, the number of classes, the size of the documents, the number of training & testing samples, the nature & complexity of the documents, the multitude of parameters used in the evolutionary algorithms and in the classifiers all lead to a very difficult comparison between the various studies.

5 Conclusions

This paper presents three different feature selection methods and a general text categorisation framework. After an extensive empirical evaluation with a set of 294 Supreme Court judgements spread into eight areas of law, we found that the simulated annealing and genetic algorithms, with an average classification accuracy of about 67%, did less well than the conceptually simpler random walk while the latter’s performance was on average better than support vector machines. The idea of using evolutionary operations for the selection of suitable features is not new, however, full-fledged implementation of these techniques in the domain of text classification is relatively recent. The random walk algorithm is very robust as it does not fluctuate as much as machine learning classifiers do when the number of training samples is reduced. The added benefit of this system is its simplicity and understandability. It is very easy for a user to improve the quality of the process by adding new training samples or new filters. In future work, we intend to choose the training instances using an active learning technique instead of passive learning. Combining the strengths of each of these feature selection methods into a single algorithm is also a potential avenue for further research.