1 Introduction

A standout amongst the most critical assignments in information and learning revelation is Data Clustering. As per Jain’s definition, “The objective of Data Clustering, otherwise called cluster analysis, is to find the common grouping(s) of an arrangement of examples, focuses, or questions” [1]. Data clustering has different applications in various fields. For instance, in Computer Vision, Image Segmentation can be characterized as a clustering issue [2]. In Information Retrieval, report grouping can give various levelled recovery and upgrades in level recovery execution [3]. In Bio informatics, grouping is utilized for enhancing different arrangement [4]. Numerous other essential applications likewise exist in fields like: Medicine, Online Social Networks, and so on [1].

Document grouping strategies have been getting an ever increasing amount attentions concerning illustration an essential and empowering device to productive organization, navigation, retrieval, and outline for colossal volumes from claiming content documents. Using great document clustering methods, computers could naturally arrange a record corpus under worthwhile group echelons, that could empower a productive scanning and route of the corpus. Next to the maintainance of corpus, appropriate best stemming algorithm is applied. The various stemming algorithms are Iterative Lovins stemmer, Lovins stemmer, Porters stemmer. It aims to minimize the words to its root. A productive document scanning furthermore route will be a profitable supplement of the deficiencies of conventional IR innovations.

Further more this paper describes the performance of document clustering focusing on the stemmer algorithms. Related work describes the existing algorithms for document clustering stemming algorithms followed by the existing problem while using LSI (Latent Semantic Indexing), SVD (Singular Value Decomposition), PCA (Principal Components Analysis). Proposed work specifies the detailed description of the document clustering and stemmer algorithms. Finally, result analysis is performed by Newsgroup20 dataset by comparing three stemmer algorithms and ICF, WSF, CSWF Factors for stemmed words.

2 Related Work

The action of compassionating the likeliness and unlikeliness across the present phenomenon’s and thus, isolating them into consequential subgroups sharing common characteristic is known as Clustering. If suppose the classification is done with absolute items of relevant features, this is a sign of best clustering method adopted. In clustering we can’t predict the assertive groups as this is an autonomous method. Simply ‘Document Clustering’ is a similar gathering of files. According to Guduru [5], Conventional techniques in document clustering utilize set of words as proportionate to discover similitude across documents. These words are thought to be commonly autonomous which in genuine application may not be the situation. Conventional VSI utilizes words to depict the documents however as a general rule the ideas/semantics/highlights/points are what portrays the documents. The extracted highlights hold the most vital thought/idea relating to the documents. Include extraction has been effectively utilized as a part of text mining with unpredictive algorithms like Principal Components Analysis (PCA), Singular Value Decomposition (SVD), and Non-Negative Matrix Factorization (NMF) including factorization of the document word matrix. In Proceedings of Berry et al. [6], Landauer et al. [7], it was said thata novel Information Retrieval technique called Latent Semantic Indexing (LSI) was sketched to call the flaws of the classic VSM model. Inorder to rectify the faults raised in lexical matching, LSI uses demographically derived concepts instead of isolated word retrieval. It considers some latent structures for word usages which were partly obscured by variability of word choice. To appraisal the structural word usage across documents and retrieval performance with database of singular value vectors, a truncated Singular Value Decomposition (SVD) is used. Performance data shows that these statistically derived vectors are more robust indicators than individual terms. Application of Latent Semantic Indexing with results can be found in [6, 7]. Singular Value Decomposition is extensively used in standard factorization of the data matrix. As SVD vectors contain negative values compared to VSM vectors which contain positive values, it became difficult for interpretation. These issues have been replaced with an outstanding algorithm NMF (formulated in Lee and Seung [8, 9]) that have various beneficiaries over Standard PCA/SVD like non-negativity in NMF ensures coherent parts of original data (text, imagesetc.) [10]. Ding et al. [10] demonstrated that when Frobenius standard is utilized as a difference and including an orthogonality imperative HT H = I, NMF is comparable to a casual type of K-Means clustering. Xu et al. were the primary ones to utilize NMF for document clustering in [11] where unit Euclidean separation imperative was added to column vectors in Yang et al. [12] broadened this progress by including the sparsity constraints since inadequacy is the critical feature of tremendous data in semantic space. In both works, the clustering was found to understand the components of the matrices. As per Xu et al. [11], Interpreting of two positive matrices U and V has been analogue with SVD. As per it, every element uij of matrix U and vij of matrix V represents the degree to which term fi∈ W belongs to cluster j and degree document i is associated with cluster j respectively. If document i solely belongs to cluster x, then vix will take on a large value while rest of the elements in ith row vector of V will take on a small value close to zero. From the work of Kanjani [13] it is seen that the accuracy of algorithm from Lee and Seung [9] is higher than their derivatives [11, 12]. In this work, he undertook the authenticated multiplicative update proposed by Lee and Seung [9]. Porter [14] represented an elementary algorithm for stemming words of English language that has been widely adopted with extensionfor standard approach of word conflation for information retrieval.

The Lovins stemming algorithm proposed by Lovins [15] is mainly used in analysis of English language with reference to porter stemmer algorithm. This algorithm exhibits its functionalities in familiar paths along with its shades it appear. Lovins gave scope to porter stemmer to shape its algorithm with ample modifications and progressiveness. According to Laxmi [16], K-means have resolved the well-known clustering problem that is an unsupervised learning algorithm. This is done by assuming the ‘K’ number of clusters for classifying a given grid processor in a clear snapshot way. As the performance result analysis depends on initial centroids, K-Means clustering does not have a guarantee for an optimal solution. Thus, the proposed system uses the partition clustering (K-Centroids clustering). With the continuous work of Laxmi [16] Disparateness cluster environment is created along with the properties of resource such as resource type, processing speed, and the memory. In order to avoid the scheduling delay, the system needs to form a cluster using the K-centroids clustering. Depending up on higher priorities, the node will move to the cluster [17]. Clustering of Documents and detail description of KNMF is studied with parallel explanation of indexing the dcouments [18].

3 Problem Statement

In this gigantic world, we are overloaded with numerous numbers of files or documents in each of its related fields. These increase overload to the users to analyze his/her strategies for a particular group. Even though they all belong to same fields, there are various sub groups present. So to distribute these files into sub groups we need to know the content and then separate them into groups. For this we can manually work for the separation of 10–100 in number files. But if these are in huge amount, manual distribution is not enough. So computer aided work is to be acquired by sub group the files based on its file name along with its content.

Many of great scholars have done a research work in text mining related to document clustering and came out with efficient outputs as years by accordingly. Here of these works, document clustering with updated rules of NMF which was proposed by Lee and Sung gave good performance. Before this, Latent Semantic Indexing (LSI), Singular Value Decomposition (SVD), Principal Components Analysis (PCA) were in great use. Of these, LSI with the help of truncated SVD has estimated the word structures. As the ages pass by new innovations are replacing the old one with extra features which took place even in document clustering problem. Here this problem has recovered with Lee and Sung’s updated NMF rule which have better performance than LSI. But these rules are limited to academics.

4 Proposed Work

A new updated model i.e., KNMF compared to Lee and Sung’s NMFis used for Automatic Document clustering. For its experimental implementation, a New Group 20 data set is used. This model is more efficient when compared to that of NMF proposed by Lee and Sung. As the k-means factor is added to NMF it gives a prominent importance in clustering with extracted features.

Extracted features play a dominant role in clustering of documents. These features are extracted based on the requirements we define before classification. These can be semantics, topics, and featured words from the defined document. Mainly these selected features condense the size of the documents with emphatic words. This makes the clustering amiable. To achieve this we need to gear up with sequential steps. Initially, Stopwords/Common clutters are removed with the help of keywords from Key Phrase Extraction Algorithm and Stemming from proposed Iterated Lovin Stemmer Algorithm. Later, Performance of MapReduce framework in Hadoop is achieved through the Parallel implementation of K-Means clustering algorithm.

Here in this paper, the first portion of implementation is elaborated that includesto free the required documents from unwanted clutters and evaluate its TF-IDF (Term frequency-Inverse Document frequency) count values for clustering process along with novel stemming algorithm.

In the proposed system of KNMF, the clustering of documents is curved between similarities of isolated documents and defined characteristics. If suppose in the computation of KNMF, these characteristics indulge basic vectors like W = {w1, w2, w3,…wx} and Term Document matrix related documents like V = {d1, d2,….di} then document di would replace with vector wx, with condition that angle between di and wx is less.

5 The Methodology Adapted

  1. 1.

    Formulate Term Document Matrix V using Term frequency –Inverse document frequency for those groups of file folders.

  2. 2.

    Euclidean length is formalized by the column length of V.

  3. 3.

    We formulate W and H by calibration of Lee and Seung NMF and KNMF.

  4. 4.

    Cosine Similarity is administered to calculate distance across document di and defined characteristics W.

  5. 5.

    Similar to K-means single turn algorithm, we empower di to wx as the angle between them is less (Fig. 1).

    Fig. 1
    figure 1

    Flowchart of the proposed methodology

Hadoop has initialized in both the Local Reference Mode and the Pseudo-Distributed Mode for running parallel versions of K-Means by submitting to the Job-Client. The time analysis measures are updated for future reference.

5.1 Steps for Initial Progressive Updating of the Documents in a Folder

  1. 1.

    Adjudicate the Document is novel or not. If its new, thenupdate in index

  2. 2.

    Now create a Text Document if it’s a new document updated in index by replacing the old one.

  3. 3.

    Using the Key Phrase Extraction algorithm defining 499 stopwords, stopwords in the documents are extracted. These stopwords are maintained as a text document which makes easy for modification.

  4. 4.

    Now the stopwords are read and removed from the document. User can also add words to these stopwords text document.

  5. 5.

    Stemming algorithm is applied on the given document.

  6. 6.

    Created Text Document is stored in index.

  7. 7.

    Lastly stray files(doc that are eliminated from set of group but are found to be in)removed.

The above Fig. 2 represents the flow diagram for the detailed description of module 1 to be presented in this paper from the whole process. The module 2 describes calculation of Term Frequency-Inverse Document Frequency (TF-IDF) Value for the given input documents. Here is a block diagram of the prescribed process for module 1 and module 2 described in this paper (Fig. 3).

Fig. 2
figure 2

Flow chart representing the Indexing of documents

Fig. 3
figure 3

Block diagram for the module 1 and 2

5.2 Stopwords Count Algorithm

  • Step1: The text document to be applied is isolated into individual words and is stored in array formats.

  • Step2: Each isolated stopword is read from the list of stopwords defined.

  • Step3: Now each word is compared with the list of stopwords using sequential array technique.

  • Step4: The word is removed from the text document if they match with words in stopwords list.

  • Step5: This process continues across whole text document and produces stopwords free text document.

  • //Defining stopWords list

  • Static Set<String>stopWordsSet = new HashSet<String>();

  • // adding stopwords list to s

  • For loop (String s: stopWords)

  • {

  • stopWordsSet.add(s);

  • }

  • // counters are defined to count stopwords

  • static enum Counters

  • {

  • StopWords;

  • }

  • //mapper phase for identifying stopwords in given text

  • Comparing StopWord with string s;

  • If (stopWord set contains string s) // if condition

  • then

  • return s;

  • assigning string value to str;

  • defining a new string tokenizer for str that is tokens;

  • whilecondition (tokens are added)

  • {

  • each token added is assigned to a string ‘word’ ;

  • ifcondition (comparing the ‘word’ with StopWord)

  •     {

  •        If true then increment COUNTER value by 1;

  • }}}

5.3 Lovins Stemmer Algorithm

The algorithm consists of two steps:

Step 1:

By eradicating the suffixes along with the treatment of held stemmers.

This Suffix Stripping consists of concatenating the endings of the word with long suffix from the list of 294 suffixes and defined 29 rules.

Step 2:

Now comes the Recoding phase where these held stems are left to clear linguistic clauses similar to double letter endings of ‘d’ and ‘t’. According to the phase, it also depends on 35 rules for termination of stems.

Example: ‘Outputting’ can be stemmed as ‘Output’ rather than ‘Outputt’.

Step 3:

Lastly, with the help of partly similar matching algorithm, fusion is rectified. It makes an rapid increase in fusion level with illusions of having two equal stems bearing a diversities as a reason of suffix stemming methods.

Example: ‘EXPLAIN’ and ‘EXPLANATION’ can be stemmed distinctly with ‘EXPLAIN’ and ‘EXPLAN’.

5.4 Porter Stemmer Algorithm

Step 1:

This step in the algorithm plays a complicate role with 3 shares of main definition. The first share deals with plurals, for example sses -> ss and removal of s.

Step 2:

The second share avoids the ending letters like ‘ED’ and ‘ING’ and implement ‘EED’ applied when applicable. This share proceeds if ‘ED’ and ‘ING’ are terminated with translating left over stem to conform that particular suffices are recognized later.

Step 3:

The third share simply manipulate one of the terminal ‘Y’ to ‘I’.

Step 4:

The remaining steps in this stemmer contain rules to deal with different order classes of suffices, initially transforming double suffices to a single suffix and then removing suffices provided the relevant conditions are met.

5.5 Proposed Iterated Lovins Stemmer Algorithm

Step 1::

Iterated stemmer is extension of lovins stemmer. So it inherits the rules of the lovins stemmer.

public class Iterated Lovins Stemmer extends Lovins Stemmer

Step 2::

Before Iterating stemmer of the given word the particular word has to be changed to lower case.

/*Defining a string STR;

  Applying a ‘IF’ condition to check the given length of STR is greater than 1;

  If so then need to convert the words into lower case by equalizing the stems.

    This runs with a loop function. */

Step 3::

Here in this algorithm the Lovinsstemer procedure carries out in reoccurrence way (may be twice) to avoid the extreme left over stemmed words.

5.6 Experimental Setting and Data Description

For the experimental purpose Fig. 4 has been taken as a sample input from 20 Newsgroup.20 NewsGroup is significant relevant information set for clustering and classification. It has an accumulation around 20,000 reports cross-wise over 20 diverse Newsgroups from Usenet. Each of these Newsgroups is gathered in a sub-directory with every clause gathering in a big document. The newsgroup data set has been executed on the following software’s.

Fig. 4
figure 4

Sample dataset from Newsgroup20

Software Version

1.0

Programming Language Referred

Java

Platform Used

Only tested in GNU/Linux

Preferred IDE

NetBeans IDE 6.0.1

Apache Cloudera

3

High end Server

PN: 7382IA4 Two socket tower Intel Xeon ES 2403(Quard core)

Cluster with Hadoop software having 15 nodes

Since Hadoop is worked with java, it is simple for interoperability. The below screens shoots has been generated after executing in the Big data Analytics cluster which was created as phase 1 of the project. Figure 4 screen shot shows the piece of Input data taken from newsgroup 20. Figure 5 Screen shot shows the adding of the given input file into HDFS. It also shows execution of StopWords program by creating a stopwords.jar file along with some paths needed for the compiling. Figure 6 shows the output screen for compiling of the Stopwords program with the required stopwords count in a given input file taken from the newsgroup20 dataset. Figure 7 screen shot shows the proposed Iterated Lovins Stemming Output of the Fig. 4 input resulting with maximum minimized stem words. Figure 8 Screen shot shows the Lovins Stemming Output of the Fig. 4 input. The Fig. 9 Screen shot shows the Porter Stemming Output of the Fig. 4 input. Figure 10 Screen shot shows the Tf-Idf count values of the sample input data files taken.

Fig. 5
figure 5

Adding the input file to HDFS and creating a Stop-words jar file

Fig. 6
figure 6

Output screen showing the count of stop words in a given input file through Map Reduce phase

Fig. 7
figure 7

Output screen for proposed iterated Lovins stemming algorithm

Fig. 8
figure 8

Output screen for Lovins stemming algorithm

Fig. 9
figure 9

Output screen for Porter stemming algorithm

Fig. 10
figure 10

Output screen showing Tf-Idf values of sample input files taken

5.7 Screen Shoots Through Experimental Setting

6 Result Analysis

Here this section elevates the performance strategies of different stemming algorithms: Iterated Lovins Stemming Algorithm, Lovins Algorithm, and Porter Stemming Algorithm. The analysis metrics considered here are: Index Compression Factor (ICF), Word Stemming Factor (WSF), Correct Stemming Word Factor (CSWF).

6.1 Index Compression Factor (ICF)

It defines the percentage of the total number of apparent words to that of Number of apparent stems after stemming. The strength of the stemmer increases along with this ICF Value.

$$ ICF = \frac{{\left( {{\text{N}} - {\text{S}}} \right)}}{\text{S}} \times 100 $$

The Fig. 11 defines the performance metric Index Compression Factor (ICF) considering three documents doc 1, doc 2, doc 3 performing comparison on three stemmer algorithms Iterated Lovins algorithm, Lovins algorithm and Porters algorithm. Here the graph shows that Iterated algorithm performs best than the other two.

Fig. 11
figure 11

Graph showing the ICF value result analysis of Stemmers

6.2 Word Stemming Factor (WSF)

It is defined as percentage of words that have been stemmed by the stemming process out of the total words in a sample. Strength of Stemming increases along with number of words stemmed.

$$ WSF = \frac{\rm{WS}}{\rm{TW}} \times 100 $$

The Fig. 12 defines the performance metric Word Stemming Factor considering three documents performing comparison on three stemmer algorithms. Here the graph shows that Iterated Lovins stemmer algorithms performs efficiently than others.

Fig. 12
figure 12

Graph analysis of WSF values of Stemmers

6.3 Correct Stemming Word Factor (CSWF)

It is defined as percentage of words that have been stemmedcorrectly out of the number of words stemmed. The accuracy of the stemmer increases with increased percentage of CSWF.

$$ CSWF = \frac{\rm{CSW}}{\rm{WS}} \times 100 $$

The Fig. 13 defines the performance metric Correct stemming word factor providing accurate stem words by considering three documents and performing comparison of three algorithms, among which iterative algorithm achieves the correct word most among three algorithms.

Fig. 13
figure 13

Graph showing the stemming words versus distinct words in each doc for stemmers

The following Table 1 shows the result analysis of different Stemmers {Iterated Lovins Stemmer, Lovins Stemmer, and Porter Stemmer} considering 3 documents {Doc1 as ‘D 1’, Doc2 as ‘D 2’, and Doc3 as ‘D 3’} as examples.

Table 1 Stemming factors defined for various stemmers considering 3 documents

The above Table 1 and graph represents result analysis ofdifferent stemmers(Iterated Lovins, Lovins, Porter). The table shows the factors that considered for efficient stemmer analysis. These factors help us to select the best stemming algorithm that suits our project. These values are calculated based on the formulas mentioned above.

Result analysis with the stopword algorithm and stemming algorithm:

  • Size of the data is reduced.

  • Execution time reduced as file size is less.

  • Success rate obtained very quickly.

  • Clustering can be easily applied.

  • Less time consumption.

  • Preserves system energy with less time period.

7 Conclusion

In this historical world, we are overloaded with numerous numbers of files or documents in each of its related fields. As the generation moves the historical and future trends makes innovations leaving behind enormous amount of data. So to process and analyze this huge data “Big Data” approaches gave relief from the database management tools and traditional data processing applications. A comparison is perfomed among Iterated Lovins algorithm, Lovins algorithm and Porters algorithms with comparative factors uisng ICF, WSF, CSWF resulting maximum mininmized stem words by Iterated Lovins algorithms.

Thus a new algorithm KNMF is used and the application will be named as ‘Progressive Text Mining radical’. Therefore with those defined characteristics of KNMF help to cluster the documents as we consider them to be ultimate labels of clusters in k-means. And also parallel implementation with MapReduce for huge sized documents lead to minimize time computation.