1 Introduction

Many applications have evolved in literature that proposed several dimensionality reduction techniques. Dimensionality reduction involves converting higher dimensional data to lower dimensional data. The objective behind dimensionality reduction is used to address both computation time and space. The traditional way to perform the dimensionality reduction is principal component analysis (PCA). PCA is computationally expensive for high dimensional data. However, there are many other traditional methods to convert the high dimensional data into data with lower dimensions. Machine learning approaches like clustering, classification are used in text mining related applications recently to convert the high dimensional data into smaller subsets to increase the computational efficiency (Bingham and Mannila 2001).

Usually text documents are comprised with irrelevant and noisy features which make learning algorithms fail to produce better accuracies. To remove the unwanted data, different data mining techniques can be applied. Feature selection and Feature extraction are the two different techniques used to classify the data (Abualigah et al. 2017). Applying techniques like text clustering for feature selection and classification of the text data is one of the mostly applied strategies recently. Feature selection is used to eliminate the unwanted text features so as to perform the text clustering and classification in efficient way. Early approach of research has focused more on converting high dimensional data to lower dimensional data by using existing distance functions. Dimensionality reduction reduces the computation time and increases the classification efficiency. Text retrieval and information retrieval are used in identifying the meaning and synonym from the document (Berka and Vajteršic 2013). Many approaches were proposed by various researchers in order to perform clustering and classification tasks. Clustering was performed based on unsupervised methods with various class label information (Bharti and Singh 2014).

Feature selection and feature extraction are the two different techniques that are widely used in dimensionality reduction. Feature selection is used to eliminate the unused features and identify the attributes in the original feature space (He et al. 2008), whereas feature extraction is used to map the data from high dimension to low dimension. The most widely used technique for performing dimensionality reduction is PCA. Feature selection is used to find the representative features in the original space (He et al. 2008). Latent semantic indexing (LSI) and latent semantic analysis (LSA) are two different components used to detect the high dimension data and perform the dimensionality reduction techniques. LSA is used in detecting the low rank optimization for the document term matrix. Detection of LSA is implemented using singular value decomposition which is also equivalent with PCA (He et al. 2008). Locally linear embedding algorithm is another method used to compute the high dimensional data with great efficiency.

LLE is used to protect the local configurations in order to protect the low dimensional space. Though the space complexity matters, the LLE is widely used to convert high dimensional data to lower dimensions (He et al. 2008). Most commonly used LLE algorithm uses Euclidean distance measure. The main focus in classification of data is towards the conversion of higher dimensions into lower dimensions by using either the PCA or Linear Discriminant analysis (LDA) which is based on supervised learning algorithms. LDA is linearly available only when there is a possibility of data/text present in the Gaussian distributed space.

Clustering is an un-supervised learning method or technique for placing similar entities at one place together. Clustering process is a challenging task from the need to overcome several challenges to achieve accuracy and efficiency. Accuracy is challenging because there is no rule of thumb that exists to help us decide the correct number of clusters. Also, the challenge for achieving efficiency and cluster quality brings the necessity and requirement for designing new and better similarity measures.

In the existing literature, several similarity functions are proposed which are used for computing similarity between two entities (Aggarwal 2007). But most of these similarity measures are suitable for computing similarity in low dimensional data space. There is an immediate need for coming up with new and accurate similarity measures that are applicable for applications related to the high dimensional data space. Now the important point is

  • What is the minimal high dimensional data space? (VinayKumar et al. 2015)

  • Why similarity measures actually suitable for low dimensional data space turns unsuitable when moving for high dimensional data spaces (VinayKumar et al. 2015).

  • How to handle noise in high dimensional data space

  • Evaluating suitability of similarity measure to find similarity between objects defined over high dimensional data space.

The present research contribution addresses the problem of dimensionality reduction designing an appropriate similarity measure for classification and clustering text data, text stream data.

Let ‘D’ indicate a data object defined over a finite set of representative attributes. Now, any randomly chosen data object defined by less than 10 attributes is treated to be low dimensional and the one which is defined by more than 10 attributes is treated to be high dimensional data object (Han et al. 2012b; VinayKumar et al. 2015). Clustering high dimensional data may be thought and viewed as a search problem of finding the clusters and the spatial dimensional over which these clusters may be generated, so as to be reliable (Aggarwal 2007; Han et al. 2012a). In Jiang et al. (2011c), the authors propose an approach for reducing the dimensionality of document-word matrix using feature clustering approach and then try to classify the test document using the reduced dimension matrix. The authors in Lin et al. (2013), introduce a new similarity measure (SMTP) for clustering text documents and document sets. The information and discussion on different algorithms, data stream models available in the literature is explained in Aggarwal (2007), Babcock et al. (2002), Tatbul and Zdonik (2006), Gaber et al. (2004). Chang and Lee (2005) discussed a sliding window based approach for finding frequent patterns in data streams. Various methods for clustering data streams are contributed in Charikar et al. (2003), Aggarwal et al. (2003), Gaber et al. (2005), Phridviraj et al. (2014). A more detailed discussion on research issues in clustering process is discussed in Sect. 2.

1.1 Need for Dimensionality Reduction

Dimensionality reduction is a key process which reduces the laborious computation process involved during clustering process. For instance, if we have 1000 dimensions, then the distance function that is applied requires computation on these 1000 dimensions. Each time during learning process, one has to consider all these 1000 dimensions for every computation. Another problem is w.r.t memory required. For example, if there are 10,000 documents and each document vector is frequency vector defined over 1000 features then the space required is equal to that required for 100,000,000 element values. Assuming that each element value requires 4 byte space, then the space required is equal to 4 × 108 byte, i.e. 381.46973 MB.

1.2 Motivation

In Jiang et al. (2011a), the basic Gaussian function is used for defining membership function. The basic Gaussian function is used for similarity computation. The membership function is product based membership function. Motivated from Jiang et al. (2011b, c), Lin et al. (2014), Radhakrishna et al. (2016a, 2017a, c, e), Aljawarneh et al. (2017b), the proposed membership function is defined. The difference between the membership function defined in (Jiang et al. 2011a) and the proposed membership function is that the former one is product-based function whereas the present membership function is a summation-based function.

1.3 Research Issues

1.3.1 Distance Function

Distance functions are always important in clustering process. The implicit operation in clustering is finding distance between two cluster elements. Membership functions are helpful to know the similarity degree between cluster elements. Existing distance functions have problems such as sparseness problem, high dimensionality problem and sensitivity to larger values and do not take into account distribution behavior (Jiang et al. 2011a, c; Radhakrishna et al. 2015, 2016b, c, d, e, 2017a, b, e, g, h, i, 2018; Aljawarneh et al. 2016, 2017a; Chen et al. 2015; Sammulal et al. 2017; Usha Rani et al. 2018; Usha Rani and Sammulal 2017; SureshReddy et al. 2014; VinayKumar et al. 2015).

1.3.2 Cluster Quality

Cluster quality is another important parameter that is to be considered. Quality clusters show properties of high cohesion w.r.t intra cluster elements and low coupling w.r.t inter cluster elements. Approaches such as silhouette and TSS (wss + bss) may be used to evaluate cluster quality.

1.3.3 Dimensionality and Sparseness

Dimensionality is a must to be addressed problem for performing clustering (Gama 2013). Dimensionality introduces noise and outliers and several methods to eliminate noise and outlier data have been reviewed in literature. Feature selection method and feature reduction method are two techniques that serve the requirement. Data sparseness is another key issue that hinders clustering. Sparseness increases complexity problem. Sparseness is situations where more number of zeroes exist in matrix instead of non-zero values. Naturally, computational process can’t discard this sparseness situation and must have to inevitably consider the data.

1.3.4 Feature Distribution

Retaining feature distribution is important to clustering (Tsai et al. 2009; Aljawarneh and Vangipuram 2018) and other approaches of learning (Neagoe and Neghina 2016; Hanneke 2016; Adeli et al. 2016). Data representation methods that assure the distribution detainment are important and helps to achieve good cluster quality of underlying clusters. Approaches for evaluating cluster quality are discussed and available in the literature.

Section 2 carries detailed literature review; Sect. 3 introduces proposed similarity function for feature clustering and dimensionality reduction which is inspired from Lin et al. (2014) and Radhakrishna et al. (2017e); Sect. 4 performs analysis of similarity values; Sect. 5 gives working example; Sect. 6 outlines the results and discussions and Sect. 7 concludes the work.

2 Literature Review

Bingham and Mannila (2001) used a random projection in dimensionality reduction for text and image data. Information retrieval for noisy and noiseless data is used in applications in order to process the data. Though the random projection method is used for dimensionality reduction, Euclidean distance measure is scaled for the generated vectors with a reduced space (Bingham and Mannila 2001). PCA was used in an optimal way for data projection in a mean square sense. One of the other methods used in text document is discrete cosine transform (Bingham and Mannila 2001). However many researchers have performed dimensionality reduction using different techniques but, computation time and the efficiency is also to be considered. He et al. (2008) used a LLE algorithm to secure the local configuration to identify the nearest neighbors. By using the LLE algorithm, it is claimed that errors are minimized with the fixed reconstruction weights (He et al. 2008). One of the main drawbacks using LLE algorithm is mapping of data points are closer and also sample data is used to identify the Euclidean attribute.

Though many approaches are used, LLE algorithm has different views in performing dimensionality reduction. The views are linear versus non-linear, local versus global, supervised versus unsupervised (He et al. 2008). Mallick and Bhattacharyya (2012) describes the maximum margin criteria to perform the dimensionality reduction on term-document matrix. Cosine similarity measure is used between the document matrices to calculate the distance between the two sample documents (Mallick and Bhattacharyya 2012). Maximum margin criterion (MMC) is calculated using the optimal projection discriminant matrix which in turn results to uncorrelated local MMC. In Mallick and Bhattacharyya (2012) SVD and PCA are not used and found the method was computationally efficient and the promising result. The initial result in reducing the data was towards stop word removal and stemming word removal. The average recognition was found to be comparatively good with 98.3% for 6 classes.

One of the methods used by Pang et al. (2013) is class centroid based dimensionality reduction. The method used in Pang et al. (2013) showed the promising results for text classification. The method proposed in Pang et al. (2013) was centroid based represented by vector space model, term frequency and inverse document frequency to compute the similarity with the reduced document matrix. The centroid dimensionality reduction is used in two stages likely with class centroid generation and class centroid projection to identify the similar documents. Thus by performing the said approaches in Pang et al. (2013), the results were found good in fetching the lower dimensional data. As the research moved towards the dimensionality reduction and accuracy in detecting the results, one such approach used by Ganguly et al. (2015) was context driven dimensionality reduction. The approach used in Ganguly et al. (2015) is for clustering the text documents. The approach proposed is to effectively increase the efficiency in clustering the document of large dimensional dataset. Many evaluation metrics and most often the named entity recognition for dimensionality reduction is applied on the document matrix to fetch the desired outputs with great efficiency (Ganguly et al. 2015). K-means clustering and HAC are the traditional approaches applied to term-document to identify the similarity in the document.

Parallel rare term vector replacement algorithm for dimensionality reduction was proposed in Berka and Vajteršic (2013). The approach used was fast and effective in dimensionality reduction for the text documents and found promising results. The approach used in Berka and Vajteršic (2013) is to convert high sparse corpus matrix document to dense sparse matrix to improve the efficiency in finding the similarity. Rare term vector (Berka and Vajteršic 2013) is used as vectors for different features to determine if the document contains particular feature. Initially the document is scanned and identified with rare elements and they are eliminated. Once the elimination is performed, the replacement of the term was applied which resulted (Berka and Vajteršic 2013) to give promising results. Truncation and unwanted feature elimination was performed for efficient computation. Parallelization was performed on a hybrid task with data partitioning to make sure the data is retrieved using parallel processing (Berka and Vajteršic 2013). Though the study in Johnson and Wichern (2007) says that there was independent component analysis (ICA) compared with SVD and PCA, the ICA has driven through the linear approximation approach of the original data which was again based on PCA. Multidimensional scaling is other approach (Hyvarinen et al. 2004) for the projection of the lower dimensional space for distance measure calculation. However though many approaches have come, most of the approaches are mainly based on SVD approach (Cox and Cox 2001). Thus the approach used in Cox and Cox (2001), Bartell et al. (1992), Berka and Vajteršic (2011), Paatero and Tapper (1994) represents the dimensionality reduction algorithm using self-organization approaches.

Stop word removal, stemming are the initial stages of any document classification procedure. The normalized data after the stemming is called as term matrix. After performing additional normalization, different dimensionality reduction techniques are applied to the document and the result found is usually better with reduced dimensionality. Feature extraction and feature selection are two main processes in classifying the text data. One important feature explained in Uguz (2012) was unsupervised dimension reduction methods for text clustering. Though the author Uguz (2012) used a hybrid method to create the informative reduced dimensional feature subspace, the author proposed a filter-wrapper and feature selection feature extraction methods to convert high dimensional data to low dimensional data. The approach used in Bharti and Singh (2014), Uguz (2012) has given better results by applying filter wrapper as dimensionality reduction process. Though many approaches were used in Uguz (2012), the result set has tried to generate only the specific feature subset. Unler et al. (2011), Bharti and Singh (2014) has proposed a new method using filter wrapper with mutual information available from filter model to weight the bit selection probabilities using SVM algorithm, the algorithm has failed to remove the noise from the datasets. Thus in the approach (Bharti and Singh 2014), a three stage unsupervised dimension reduction was used to give the best results and select the data accordingly.

As many approaches have evolved, one recent approach explained in Xu et al. (2018) was towards applying classical dimensional data reduction and sample selection methods for large scale data. Classic machine learning approaches has been applied to clustering and random forest algorithms to identify the best results. Deep learning and other approaches were used to perform the dimensionality reduction on the large datasets (Xu et al. 2018). Processing of large datasets has become a trivial task and how to get the efficiency is always trivial.

3 Similarity Function for Feature Clustering

The similarity function for feature clustering is described in this section. The proposed similarity function is given by Eq. (1)

$$Sim\left( {\alpha_{i} ,\alpha_{j} } \right) = \frac{{F\left( {\alpha_{i} ,\alpha_{j} } \right) + \varphi }}{\varphi + 1}$$
(1)
$$F\left( {\alpha_{i} ,\alpha_{j} } \right) = \frac{{\mathop \sum \nolimits_{h = 1}^{h = m} {\mathcal{G}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)}}{{\mathop \sum \nolimits_{h = 1}^{h = m} {\mathcal{H}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)}}$$
(2)

where

$${\mathcal{G}}\left( {\alpha_{ih} ,\alpha_{jh} } \right) = \left\{ {\begin{array}{*{20}l} {e^{{ - \left( {\frac{{\alpha_{ih} - \alpha_{jh} }}{\sigma }} \right)^{2} }} ;} \hfill &\quad { \alpha_{ih} \ne 0\,and\,\alpha_{jh} \ne 0} \hfill \\ { - \varphi ;} \hfill &\quad { either\,\alpha_{ih} \,or\,\alpha_{jh} \,is\,0} \hfill \\ {0;} \hfill &\quad {both\,\alpha_{ih} ,\,\alpha_{jh} \,are\,0} \hfill \\ \end{array} } \right.$$
$${\mathcal{H}}\left( {\alpha_{ih} ,\alpha_{jh} } \right) = \left\{ {\begin{array}{*{20}l} {0;} \hfill &\quad {both\,\alpha_{ih} \,and\,\alpha_{jh} \,are\,0} \hfill \\ {1;} \hfill &\quad {else} \hfill \\ \end{array} } \right.$$

In the equation for similarity function, variables \(\alpha_{ih} ,\alpha_{jh}\) are probability values. For instance, \(\alpha_{ih}\) and \(\alpha_{jh}\) denotes the probabilistic chance that word \(w_{i}\) and \(w_{j}\) may belong to a given class label, h.

The function \(F\left( {\alpha_{i} ,\alpha_{j} } \right)\) is the fraction of \({\mathcal{G}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)\) and \({\mathcal{H}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)\). The parameter \(\varphi\) is a constant and the value of \(\varphi\) that best fits is equal to 1. The highest possible value of \(F\left( {\alpha_{i} ,\alpha_{j} } \right)\) is equal to 1 and lowest possible value is \(- \varphi\). As \(\varphi\) is set to 1, hence the lowest possible value is − 1.

4 Analysis of Similarity Values

The similarity values attained using proposed function are analysed in the next three subsections. For analysis, three cases are considered. They are (a) similarity value in worst case, (b) similarity value in best case and (c) similarity value in average case.

4.1 Worst Case

In the worst case, each component of \({\mathcal{G}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)\) is equal to \(- \varphi\). The function \(F\left( {\alpha_{i} ,\alpha_{j} } \right)\) is computed as a fraction of \({\mathcal{G}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)\) and \({\mathcal{H}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)\).

$$\begin{aligned} F\left( {\alpha_{i} ,\alpha_{j} } \right) & = \frac{{\mathop \sum \nolimits_{h = 1}^{h = m} {\mathcal{G}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)}}{{\mathop \sum \nolimits_{h = 1}^{h = m} {\mathcal{H}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)}} \\ & = \frac{ - \varphi - \varphi - \varphi \cdots m\,times}{1 + 1 + 1 + \cdots m\,times} \\ & = - \varphi \\ \end{aligned}$$

The resulting value of \(F\left( {\alpha_{i} ,\alpha_{j} } \right)\) is equal to \(- \varphi\) in worst case.

So, the similarity value in worst case is obtained as 0.

$$Sim\left( {\alpha_{i} ,\alpha_{j} } \right) = \frac{{F\left( {\alpha_{i} ,\alpha_{j} } \right) + \varphi }}{\varphi + 1} = \frac{ - \varphi + \varphi }{\varphi + 1} = 0$$

This proves that \(Sim\left( {\alpha_{i} ,\alpha_{j} } \right)\) has lower bound.

4.2 Best Case

In the best case, each component of \({\mathcal{G}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)\) is equal to 1. The function \(F\left( {\alpha_{i} ,\alpha_{j} } \right)\) is computed as a fraction of \({\mathcal{G}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)\) and \({\mathcal{H}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)\).

$$\begin{aligned} F\left( {\alpha_{i} ,\alpha_{j} } \right) & = \frac{{\mathop \sum \nolimits_{h = 1}^{h = m} {\mathcal{G}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)}}{{\mathop \sum \nolimits_{h = 1}^{h = m} {\mathcal{H}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)}} \\ & = \frac{1 + 1 + 1 + \cdots m\,times}{1 + 1 + 1 + \cdots m\,times} = 1 \\ \end{aligned}$$

The resulting value of \(F\left( {\alpha_{i} ,\alpha_{j} } \right)\) is equal to \(1\) in best case.

So, the similarity value in best case is obtained as 1.

$$Sim\left( {\alpha_{i} ,\alpha_{j} } \right) = \frac{{F\left( {\alpha_{i} ,\alpha_{j} } \right) + \varphi }}{\varphi + 1} = \frac{1 + \varphi }{\varphi + 1} = 1$$

This proves that \(Sim\left( {\alpha_{i} ,\alpha_{j} } \right)\) has upper bound.

4.3 Average Case

In the average case, some component of \({\mathcal{G}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)\) are defined by exponential function, some components are defined as \(- \varphi\) and other remaining components can be zero. The function \(F\left( {\alpha_{i} ,\alpha_{j} } \right)\) is computed as a fraction of \({\mathcal{G}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)\) and \({\mathcal{H}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)\).

$$\begin{aligned} F\left( {\alpha_{i} ,\alpha_{j} } \right) & = \frac{{\mathop \sum \nolimits_{h = 1}^{h = m} {\mathcal{G}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)}}{{\mathop \sum \nolimits_{h = 1}^{h = m} {\mathcal{H}}\left( {\alpha_{ih} ,\alpha_{jh} } \right)}} \\ & = \frac{{e^{{ - \left( {\frac{{\alpha_{ih} - \alpha_{jh} }}{\sigma }} \right)^{2} }} + e^{{ - \left( {\frac{{\alpha_{ih} - \alpha_{jh} }}{\sigma }} \right)^{2} }} + \cdots A\,times + - \varphi - \varphi \cdots B\,times }}{1 + 1 + \cdots A\,times + 1 + 1 + \cdots B\,times} \\ \end{aligned}$$

The above expression can be rewritten as given by

$$F\left( {\alpha_{i} ,\alpha_{j} } \right) = \frac{{A*e^{{ - \left( {\frac{{\alpha_{ih} - \alpha_{jh} }}{\sigma }} \right)^{2} }} - {\text{B*}}\varphi }}{{\left( {A + B} \right)}}$$

So, the expression for similarity value in average case is derived as

$$\begin{aligned} Sim\left( {\alpha_{i} ,\alpha_{j} } \right) & = \frac{{\frac{{A*e^{{ - \left( {\frac{{\alpha_{ih} - \alpha_{jh} }}{\sigma }} \right)^{2} }} - {\text{B*}}\varphi }}{{\left( {A + B} \right)}} + \varphi }}{\varphi + 1} \\ {\text{i}} . {\text{e}} .\quad Sim\left( {\alpha_{i} ,\alpha_{j} } \right) & = \frac{{A \left( {e^{{ - \left( {\frac{{\alpha_{ih} - \alpha_{jh} }}{\sigma }} \right)^{2} }} + \varphi } \right)}}{A + B} \left( {1 + \varphi } \right) \\ \end{aligned}$$

The analysis proves that the similarity function has tight upper and lower bounds. The upper bound value is one and lower bound value is zero.

5 Text Clustering Process

Incremental method is another method used for clustering of data. In this method, we start with first word pattern chosen as mean and this word pattern acts element of the first cluster. Using this cluster mean, we now take each word pattern and find the similarity of the new word pattern with the first cluster’s mean. If this similarity is greater than mentioned threshold value, then w.r.t given similarity threshold, we place the word pattern into the first cluster and find the resultant mean of cluster. If the similarity is less than the given similarity threshold, we place word pattern in the new cluster. For each word pattern, we find its similarity with the existing clusters and place it in the cluster to which it is more similar. Thus, we can generate finite clusters. This, method is better than k-means because if a new word pattern is to be added to the existing clusters, we can achieve it by just finding the similarities with all the existing cluster means.

The Step by step algorithm flow is as follows:

figure a

5.1 Overview of Dimensionality Reduction Process

figure b

The demonstration using pictorial representation shows that the word distribution before dimensionality reduction and after dimensionality reduction remained same. Also, it is to be noted that proposed approach does not lose any information and do not add any noisy data. Hence, the reduced low dimensional matrix is suitable form for performing clustering and classification tasks. The approach followed for clustering word patterns is motivated from (Jiang et al. 2011b). In our approach for feature clustering, we considered the similarity function introduced in this paper.

6 Results

This section gives the experiment results using proposed approach. Figure 1 plots the number of features before dimensionality reduction (DR). The number of documents considered 200, 300, 400, 500, 600 and 700. The number of features are 2581, 2836, 3200, 3020, 3456, and 3263 respectively.

Fig. 1
figure 1

Number of features after initial pre-processing

Figure 2 plots the number of features after dimensionality reduction using information gain followed by SVD and proposed approach for dimensionality reduction.

Fig. 2
figure 2

Number of features after dimensionality reduction using two approaches

The number of features using feature clustering based DR is less than dimensionality reduction using information gain, followed by SVD. Another important advantage of proposed approach is the word distribution in documents before DR and after DR are same. This means we have transformed documents in one form to another form such that distributions are same before and after DR.

Figure 3 shows the number of features using feature clustering based DR is less than dimensionality reduction using IG and information gain followed by SVD. The number of documents considered are a random sample of 100, 200, 300, 400, 500 and 600 documents from R8 dataset.

Fig. 3
figure 3

Comparison of dimensionality reduction of various approaches

Classifier accuracies for 350 documents with 268 features after dimensionality reduction using proposed approach are computed for euclidean and cosine distance measures and these are compared to accuracies obtained using proposed measure. The accuracy using proposed measure is better to euclidean and cosine measures as depicted in Fig. 4.

Fig. 4
figure 4

Classifier accuracies

Figure 5 shows the number of features using feature clustering based DR is less than dimensionality reduction using IG and information gain followed by SVD. The number of documents considered are a random sample of 100, 200, 300, 400, 500, 600 and 800 documents from R8 dataset.

Fig. 5
figure 5

Comparison of dimensionality reduction for random sample

Figure 6 shows the number of features using feature clustering based DR is less than dimensionality reduction using IG and information gain followed by SVD. The number of documents considered are a random sample of 150, 250, 350, 450, 550 documents from R58 dataset.

Fig. 6
figure 6

Comparison of dimensionality reduction for random sample

Figure 7 shows the number of features using feature clustering based DR is less than dimensionality reduction using IG and information gain followed by SVD. The number of documents considered are a random sample of 150, 250, 350, 450, 550 and 650 documents from R58 dataset.

Fig. 7
figure 7

IG versus proposed DR approach

Figure 8 gives the plot of number of features before and after dimensionality reductions carried by IG, SVD and proposed approach for a total of 500 text documents w.r.t trade class of Reuters dataset. The dimensionality using proposed approach is significantly less to other approaches and the distribution attained using proposed method is same as the distribution of words in documents before dimensionality reduction.

Fig. 8
figure 8

IG versus proposed DR approach

Figure 9 shows classifier accuracies of trade class of Reuter’s dataset for 500 randomly chosen text documents. The classifier accuracies for kNN classifier for k = 3 to k = 15 are obtained for Euclidean, SMTP (Lin et al. 2014) and proposed measure. It can be concluded from the experiments conducted that the proposed approach yields lesser dimensionality and at the same time retains distribution of words of documents.

Fig. 9
figure 9

Classifier accuracies of trade class of Reuters dataset for randomly chosen 500 documents

7 Conclusions

Feature representation and dimensionality reduction techniques are two important tasks in text clustering and classification. In this paper, an approach for feature representation and dimensionality reduction of text documents is described. The feature representation and dimensionality reduction approaches introduced retains the distribution of features. Output of feature representation is a hard representation matrix. The hard matrix is used to obtain the low dimensionality document matrix. The input for clustering is the low dimensional matrix. The working of proposed approach is explained using a case study that supports the importance of the approach and advantage of dimensionality reduction. Experiment results prove the importance of proposed approach for dimensionality reduction. A substantial amount of dimensionality reduction is achieved using proposed method and at the same time feature distribution after dimensionality reduction is retained w.r.t documents before dimensionality reduction. This makes the process important and allows obtaining better classification accuracies. As a future extension, there is a scope and possibility to devise new membership functions and apply them for clustering to obtain clusters with good cluster quality and subsequently obtain better classification accuracies.