Keywords

1 Introduction

Natural language call routing is an important problem in the design of modern automatic call services and the solving of this problem could lead to improvement of the call service. Generally natural language call routing can be considered as two different problems. The first one is speech recognition of calls and the second one is call categorization for further routing. This paper focuses on text categorization methods applied for call routing.

Text classification can be considered to be a part of natural language understanding, where there is a set of predefined categories and the task is to automatically assign new documents to one of these categories. The method of text preprocessing and text representation influences the results that are obtained even with the same classification algorithms.

In the vector space model [18] text categorization is considered as a machine learning problem. The complexity of text categorization with a vector space model is compounded by the need to extract the numerical data from text information before applying machine learning methods. Therefore, text categorization consists of two parts: text preprocessing and classification using the obtained numerical data.

All text preprocessing methods are based on the idea that the category of the document depends on the words or phrases from this document. The simplest approach is to take each word of the document as a binary coordinate and the dimensionality of the feature space will be the number of words in our dictionary (“bag of words” method [10]). There exist more advanced approaches for text preprocessing to overcome this problem such as term weighting methods. There exist different unsupervised and supervised term weighting methods. The most well-known unsupervised term weighting method is TF-IDF [17]. The following supervised term weighting methods are also considered in the paper: Gain Ratio [4], Confident Weights [21], TM2 [23], Relevance Frequency [13], Term Relevance Ratio [11] and Novel Term Weighting [5]; these methods involve information about the classes of the utterances. It is important to notice that we use no morphological and stop-word filtering before text preprocessing. It means that the text preprocessing can be performed without expert or linguistic knowledge and that the text preprocessing is language-independent.

After text preprocessing we obtain a vector of numerical variables for each document and the dimensionality of the feature space is the number of words in the dictionary. In this case direct application of the machine learning algorithms is time-consuming. Many researchers have used a variety of unsupervised techniques for various tasks in order to improve classification quality or to decrease dimension of the features. One common approach is to induce word features with unsupervised methods (for example, clustering which was used in [8, 12, 14, 16] or neural language models which have been proposed by [1, 3, 15, 20] and then apply supervised algorithm. In our work we propose a novel feature extraction method based on terms belonging to classes as a dimensionality reduction method. The method reduces the dimensionality radically; the dimensionality equals number of classes. After dimensionality reduction we apply k nearest neighbours algorithm (k-NN) for classification.

In our paper we propose collectives of different term weighting methods based on an idea that different term weighting methods provide classification errors on different utterances; the simultaneous use of different term weighting methods may increase the effectiveness of text classification. We propose meta-classification with rule induction as a collective classification approach.

This paper is organized as follows: In Sect. 2, we describe the problem and the database. Section 3 describes the considered term weighting methods. The feature extraction method and the classification algorithms are presented in Sect. 4. Section 5 reports on the experimental results. Finally, we provide concluding remarks and directions for future investigations in Sect. 6.

Table 1 The distribution of the classes

2 Corpora Description

The data for testing and evaluation consists of 292,156 user utterances recorded in English language from caller interactions with commercial automated agents. The database contains calls in textual format after speech recognition. The database is provided by the company Speech Cycle (New York, USA). Utterances from this database are manually labelled by experts and divided into 20 classes (such as appointments, operator, bill, internet, phone and technical support) One of them is a special class TE-NOMATCH which includes utterances that cannot be put into another class or can be put into more than one class. The classes and their distribution are presented in Table 1.

The database contains 45 unclassified calls and they were removed. The database contains also 23,561 empty calls without any words. These calls were placed in the class TE-NOMATCH automatically and they were also removed from the database. As a rule, the calls are short in the database; many of them contain only one or two words. So there are a lot of duplicated utterances in the database and utterance duplicates were removed. After that the database contains 24,458 unique non-empty classified calls.

We have performed 20 different separations of the database into training and test samples randomly. The train samples contain 90 % of the calls and the test samples contain 10 % of the calls. For each training sample we have designed a dictionary of unique words which appear in the training sample. The size of the dictionary varies from 3275 to 3329 words for different separations.

3 Term Weighting Methods

After generating training and test samples we performed term weighting. As a rule, term weighting is a multiplication of two parts: the part based on the term frequency in a document (TF) and the part based on the term frequency in the whole database. The TF-part is fixed for all considered term weighting methods and is calculated as following:

$$\begin{aligned} {TF}_{ij}=\log \left( {tf}_{ij}+1\right) ; {tf}_{ij}= \frac{{n}_{ij}}{{N}_{i}}, \end{aligned}$$

where \(n_{ij}\) is the number of times the ith word occurs in the jth document, \(N_{j}\) is the document size (number of words in the document).

The second part of the term weighting is calculated once for each word from the dictionary and does not depend on an utterance for classification. We consider 7 different methods for calculation of the second part of term weighting.

3.1 Inverse Document Frequency (IDF)

IDF is a well-known unsupervised term weighting method which was proposed in [17]. There are some modifications of IDF and we use the most popular one:

$$\begin{aligned} {idf}_{i}=\log \frac{\left| D \right| }{{n}_{i}}, \end{aligned}$$

where \(\left| D \right| \) is the number of documents in the training set and \(n_i\) is the number of documents that have the ith word.

3.2 Gain Ratio (GR)

Gain Ratio (GR) is mainly used in term selection [24], but in [4] it was shown that it could also be used for weighting terms, since its value reflects the importance of a term. The definition of GR is as follows:

$$\begin{aligned} GR\left( {t}_{i},{c}_{j} \right) =\frac{\sum _{c\in \{{c}_{j}, {\bar{c}}_{j} \}}^{}\sum _{t\in \{{t}_{j}, {\bar{t}}_{j} \}}^{}P\left( t, c \right) \cdot \log \frac{P\left( t,c \right) }{P\left( t \right) \cdot P\left( c \right) }}{-\sum _{c\in \{{c}_{j}, {\bar{c}}_{j} \}}^{}P\left( c \right) \cdot \log P\left( c \right) }, \end{aligned}$$

where P(tc) is the relative frequency that a document contains the term t and belongs to the category c; P(t) is the relative frequency that a document contains the term t and P(c) is the relative frequency that a document belongs to category c.

Then, the weight of the term \(t_i\) is the max value between all categories as follows:

$$\begin{aligned} GR\left( {t}_{i} \right) =\max _{{c}_{j}\in C}GR\left( {t}_{i},{c}_{j} \right) , \end{aligned}$$

where C is a set of all classes.

3.3 Confident Weights (CW)

The Confident Weights method (ConfWeight) is a supervised term weighting that involves information about classes which correspond to documents. This approach has been proposed in [21]. The main idea of the method is that the term t has a non-zero weight in the class c only if the frequency of the term t in documents of the c class is greater than the frequency of the term t in all other classes.

Firstly, the proportion of documents containing term t is defined as the Wilson proportion estimate p(xn) [22] by the following equation:

$$\begin{aligned} p\left( x,n \right) =\frac{x+0.5{z}^{2}_{\alpha /2}}{n+{z}^{2}_{\alpha /2}}, \end{aligned}$$

where x is the number of documents containing the term t in the given corpus, n is the number of documents in the corpus and \(\varPhi \left( {z}_{\alpha /2} \right) =\alpha /2\), where \(\varPhi \) is the t-distribution (Students law) when \(n<30\) and the normal distribution when \(n\ge 30\).

In this work \(\alpha = 0.95\) and \(0.5{z}^{2}_{\alpha /2}=1.96\) (as recommended by the authors of ConfWeight). For each term t and each class c two functions \(p_{pos} (x, n)\) and \(p_{neg} (x, n)\) are calculated. For \(p_{pos} (x, n)\) x is the number of vectors which belong to the class c and have term t; n is the number of documents which belong to the class c. For \(p_{neg} (x, n)\) x is number of vectors which have the term t but do not belong to the class c; n is the number of documents which do not belong to the class c.

The confidence interval \((p^-,p^+)\) at 95 % is calculated using the following equations:

$$\begin{aligned} p^-=p-0,5{z}^{2}_{\alpha /2}\sqrt{\frac{p(1-p)}{n+{z}^{2}_{\alpha /2}}}; p^+=p+0,5{z}^{2}_{\alpha /2}\sqrt{\frac{p(1-p)}{n+{z}^{2}_{\alpha /2}}}. \end{aligned}$$

The strength of the term t in the category c is defined as the follows:

$$\begin{aligned} str(t,c)={\left\{ \begin{array}{ll} \log _2\frac{2p^-_{pos}}{p^-_{pos}+p^+_{neg}}, &{} \mathrm {if} {\ p^-_{pos}>p^+_{neg}},\\ 0, &{} \mathrm {otherwise}. \end{array}\right. } \end{aligned}$$

The maximum strength (Maxstr) of the term \(t_i\) is calculated as follows:

$$\begin{aligned} Maxstr({t}_{i})=\max _{{c}_{j}\in C}str\left( {t}_{i},{c}_{j} \right) ^{2}. \end{aligned}$$

The ConfWeight method uses Maxstr as an analogue of IDF.

The numerical experiments conducted in [21] have shown that the ConfWeight method outperforms the gain ratio and TF-IDF with SVM and k-NN as classification methods on three benchmark corpora.

3.4 TM2 (Second Moment of a Term)

This supervised term weighting method was proposed in [23].

Let \(P(c_j|t)\) be the probability that a document belongs to the category \(c_j\) with the condition that the document contains the term t and belongs to the category c; \(P(c_j)\) is the probability that a document belongs to the category c without any conditions. The idea is the following: the more \(P(c_j|t)\) is different from \(P(c_j)\) the more important the term \(t_i\) is. Therefore, we can calculate the term weight as the following:

$$\begin{aligned} TM2(t_i)= \sum _{j=1}^{\left| C \right| }\left( P(c_j|t)-P(c_j) \right) ^2. \end{aligned}$$

where C is a set of all classes.

3.5 Relevance Frequency (RF)

The RF term weighting method was proposed in [13] and is calculated as the following:

$$\begin{aligned} rf(t_i, c_j)=\log _2 \left( 2+\frac{a_j}{\max \{1, \bar{a_j}\}} \right) ;rf(t_i)=\max _{{c}_{j}\in C}rf\left( {t}_{i},{c}_{j} \right) , \end{aligned}$$

where \(a_j\) is the number of documents of the category \(c_j\) which contain the term \(t_i\) and \(\bar{a_j}\) is the number of documents of all the other categories which also contain this term.

3.6 Term Relevance Ratio (TRR)

The TRR method [11] uses tf weights and it is calculated as the following:

$$\begin{aligned} TRR(t_i, c_j)=\log _2 \left( 2+\frac{P(t_i|c_j)}{P(t_i|\bar{c_j}} \right) ;TRR(t_i)=\max _{c_{j}\in C}TRR\left( {t}_{i},{c}_{j} \right) , \end{aligned}$$
$$\begin{aligned} P(t_i,c)=\frac{\sum _{k=1}^{\left| T_c \right| }tf_{ik}}{\sum _{l=1}^{\left| V \right| }\sum _{k=1}^{\left| T_c \right| }tf_{lk}}, \end{aligned}$$

where \(c_j\) is a class of the document, \(\bar{c_j}\) is all of the other classes of \(c_j\), V is the vocabulary of the training data and \(T_c\) is the document set of the class c.

3.7 Novel Term Weighting (NTW)

This method was proposed in [5, 6]. For each term we assign a real number term relevance that depends on the frequency in utterances. Term weight is calculated using a modified formula of fuzzy rules relevance estimation for fuzzy classifiers [9]. Membership function has been replaced by word frequency in the current class.

The details of the procedure are the following. Let L be the number of classes; \(n_i\) is the number of documents which belong to the ith class; \(N_{ij}\) is the number of occurrences of the jth word in all articles from the ith class. \(T_{ij} = N_{ij} / n_i\) is the relative frequency of occurrences of the jth word in the ith class; \(R_j=\max _i T_{ij}\); \(S_j=\arg \max _i T_{ij}\) is the class which we assign to the jth word. The term relevance \(C_j\) is calculated by the following:

$$\begin{aligned} C_j=\frac{1}{\sum _{i=1}^{L}T_{ij}}\cdot \left( R_j-\frac{1}{L-1}\cdot \sum _{i=1, i\ne S_j}^{L}T_{ij} \right) . \end{aligned}$$

4 Feature Extraction and Classification Algorithms

Dimensionality reduction allows the processing time for the application of a machine learning algorithm to be decreased with acceptable classification results. For some machine learning algorithms a dimensionality reduction is critically important because they cannot be applied for problems with such large dimensionality as the size of the dictionary of a text classification problem. There are two different methods of dimensionality reduction: feature extraction and feature selection. With the first approach we generate a small number of new features from previous ones, i.e. with clustering of terms [7]. With the second one we remove useless and non-informative features.

We propose a novel feature extraction method based on terms belonging to classes. The idea is to assign each term from the dictionary to the most appropriate class. Such assignment is performed during the calculation of GR, CW, RF, TRR, and NTW. With TF-IDF and TM2 we can also assign one class for each term using the relative frequency of the word in classes:

$$\begin{aligned} S_j=\arg \max _{c\in C} \frac{n_{jc}}{N_c}, \end{aligned}$$

where \(S_j\) is the most appropriate class for the jth term, c is an index of a class, C is a set of all classes, \(n_{jc}\) is the number of documents of the cth class which contain the jth term, \(N_c\) is the number of all documents of the cth class.

After the assigning of each word to one class and term weighting we can calculate the sums of term weights in a document for each class. We can put these sums as new features of the text classification problem. Therefore, such a method reduces the dimensionality radically; the dimensionality of the classification problem equals the number of classes.

We used k-NN with distance weighting as the classification algorithm with all seven term weighting methods. We have varied k from 1 to 15.

After the classification was performed with all term weighting methods we put the classification results (the predicted class by each term weighting method) on the training sample as categorical features for the next collective meta-classifier. We used rule induction [2] as a meta-classifier for collectives of term weighting methods.

RapidMiner was used as software for k-NN and rule induction application [19]. The classification criterion is the macro F-score.

5 Results of Numerical Experiments

At first we have tested different term weighting methods with different values of k for 20 separations of the database on test samples. The numerical experiments provide ranking of the methods with statistical analysis (t-test). The ranking is presented in Table 2. Table 2 contains the averaged results for 20 separations of the database.

Table 2 Ranking of the term weighting methods
Table 3 The results with collectives of the term weighting methods
Table 4 The comparison of the collectives with TRR
Fig. 1
figure 1

The average F-score of the individual methods and the collectives depending on k

For rule induction application we have designed the collectives with different numbers of included methods from 7 to 2, with consistent exception of the worst methods. For collectives with 3 methods we include RF despite the fact that there is no statistically significant difference between RF and TM2. Therefore, the collective 7 contains all 7 term weighting methods; the collective 6 contains TRR, CW, RF, TM2, NTW, and GR; the collective 5 contains TRR, CW, RF, TM2, and NTW; the collective 4 contains TRR, CW, RF, and TM2; the collective 3 contains TRR, CW, and RF; the collective 2 contains TRR and CW. The results of the numerical experiments with the collectives of term weighting methods are presented in Table 3.

Table 4 contains a comparison of the collectives with the best term weighting method TRR. The 2nd column contains the average increment of the max F-score, the 3rd column contains the number of database separations with the increment of the max F-score from 20 separations, and the 4th column contains the values of t-test in comparison with the best term weighting method TRR.

Figure 1 shows the average F-score of all of the individual methods and all of the collectives depending on k.

6 Conclusions and Future Directions

The investigations have shown that the collective of two best term weighting methods (TRR and CW) provides a statistically significant increment of classification effectiveness in comparison with the best term weighting (TRR) for natural language call routing. The collectives with more methods do not differ significantly from the best individual method (TRR). Also we can conclude that the collectives are more effective in comparison with individual term weighting methods for higher values of k in k-NN.

For the next investigations we propose the following:

  • The application of a different classification algorithm for individual term weighting methods: for example SVM and artificial neural networks.

  • The separation of the database into three samples. The first one will be used as training sample for the learning of individual term weighting methods, the second one will be used for the learning of the collectives, and the third one will be used as a test sample. In the current work we use the same training sample for individual methods and for the collectives. The use of different samples for learning may increase the effectiveness of the collectives, especially for the k-NN algorithm with low values of k.

In future work, we are planning the following:

  • The application of different feature selection and feature extraction methods.

  • The application of different approaches for meta-classification.

  • The application of collectives of term weighting methods for different text classification problems and for different databases.