Keywords

1 Introduction

In today's world, the amount of unstructured data is growing in an enormous way that the existing relational systems are incompetent in handling them. The form of data can be audio-video clips, textual data, software program logs, flight records, etc. The information hidden inside those data leads to a complete new world of opportunity and insight. This is the reason for why every organization and individual is demanding to explore these huge amount of data, which constructs the foundation of text mining. It is also called as a practice of textual form of data to discover the key conceptions, themes, hidden trends, and relationships without prior knowledge of exact terms that has been used by author to express the concept [1]. As part of text mining algorithms of data mining, text analytics, machine learning, natural language processing, and statistics are used to extract high quality, useful information from unstructured formats. Test mining is also popular as “text analytics” is a means by which unstructured data is processed for machine use. For example, if a Twitter comment “I don’t find the app useful: it’s really slow and constantly crashing.” is taken into consideration then text mining of the contextual information is really important to help us understand why the tone might be negative and what may be the cause of such customer disappointment as shown in Fig. 1. These analyses may lead to the answer of questions like “Is the person replying to another negative tweet? or is this the original composition? or what is the application name? or is this the only problem with the app or there are other problems too?, etc.

Fig. 1
figure 1

Text mining

1.1 Conventional Process Flow of Text Mining

Textual data are in the form of unstructured data are normally available in readable document formats. These formats can be user comments, e-mails, corporate reports, web pages, news articles, etc. According to conventional text mining process, the documents are first derived into a quantitative representation. Once the textual data is transformed into a set of numbers which precisely capture the hidden pattern in it, then any data mining algorithm or statistical forecasting model is applied on the numbers for generating insights or for discovering noble facts [2, 3].

A typical text mining process generally have the following sub-tasks to complete the process.

Data Collection

Collection of textual data is the first step in any text mining research [3].

Text Parsing and Transformation

The next step is to parse the words from the documents. Therefore, sentences, parts of speech, and stemming words [3] are identified from the document. Document variables associates with author, category, gender, etc., are also extracted with the parsed words.

Text Filtering

After the parsing of words, there may be some irrelevant words which are not required in the analysis, and those words are removed from the document. This is done manually by browsing through the terms or words. This is the most time-consuming and subjective tasks in all of the text mining steps. A fair amount of subject knowledge and domain knowledge is required to perform this task. In case of document filtering [3], the selected keywords are searched in all the selected documents. If any document does not contain any of the keywords, then it is removed from the list of analysis.

Text Transformation

In this step, the document is presented in a numerical form of matrix [3]. This matrix generally contains the occurrences of the words is also called as term frequency. Numerical presentation of the document is mandatorily required to perform any kind of analytics on the document. Therefore, this step converts the unstructured text to a workable analytical document.

Text Mining

In this step, hidden patterns and knowledge are extracted using mining algorithms such as classification, clustering, association analysis, and regression analysis. As shown in Fig. 2, text mining is an iterative process where the process of filtering to mining is repeated based on the feedback received from this step [4].

Fig. 2
figure 2

Conventional text mining process

1.2 Applications of Text Mining

Text mining process is being used to provide answers to industrial queries and to optimize daily operations efficiently. It is also used to develop business strategic decisions in finance, automobile, marketing, health care, etc. Hidden patterns, trends, and perceptions are discovered from a huge volume of unstructured data using techniques like data analytics, categorization, and sentiment analysis. In this research, we have discussed below applications of text mining.

Risk Management

Inadequate risk estimation is accounted for biggest reason of failures in any industry. In these cases, text mining is used to estimate the proper risk in business and also to identify the most adequate way to mitigate the risk [3]. Therefore, the application of text mining software has drastically increased the capacity of risk mitigation in industries.

Knowledge Management

Managing huge volume of data containing the historical information creates many problems like huge storage space, latency in finding specific information, etc. The healthcare industries are a classic example for the above problems where the information of historical patients’ data can be potentially used for medical analysis and product development [3]. Therefore, text mining is used to filter the useful informations by discarding the irrelevant ones. Then, many analytic algorithms are run on the filtered data to find and store the extracted unknown facts only, which reduces the storage issue, latency issue, etc.

Cybercrime Prevention

Random availability of information over Internet can bear the brunt of cybercrimes. Text mining is used to trace the cybercrime activities and also helps to identify the source of intruders [3]. Therefore, text mining is used by law enforcement and intelligence agencies.

Customer Care Service

Customer care services are better operated using text mining and natural language processing. Text analytics software improves customer experiences. These analytics use many valuable information sources such as survey and customer call notes which help effectiveness and speedy resolution of customer problems [3]. Test mining is also used for automated faster responses to customer queries.

Contextual Advertising

Digital advertising has got a new height of safety and user’s privacy by applying text mining as core engine of contextual retargeting [3]. It also provides better accuracy in contextual advertising.

Business Intelligence

Text mining is uses to support faster decision making by taking consideration of valuable enterprise data [3]. It helps to find future insights for improving the business by monitoring huge number of data sources.

Social Media

Social media is a potential source of huge amount of unstructured data inside which a lot of hidden patterns related to business, sentiment [5], and intelligence are there. Many organizations predict the future customer needs using text analytics. This information help organizations to extract the customer opinions, to understand their emotions, and also to predict their requirements. Text mining has made revolutionary modifications in social media.

2 Literature Survey

As text mining is our focus of research, therefore, some recent research artifacts are studied. All the related studies and analysis points that application of big data technologies like Hadoop MapReduce, k-means, particle swarm optimization (PSO), and cloud computing provides better result, reduced execution time and better solution for big data problems. Large data sets can be analyzed using Hadoop cluster and parallelization of clustering algorithms and using parallel k-means clustering provides a drastic reduction in execution time [1]. Document clustering, parallel k-means, and distributed computing [6] are the techniques that have been used with Hadoop MapReduce in the study. After selecting centroids randomly, every document is fed to one mapper. The mapper calculates the new centroids based on the Euclidean distance. The result of all mappers is sent to a reducer to calculate a resulting centroid which then compared with the assumed centroid [7]. If there is a difference in centroid value, then the process is iterated, otherwise, the centroid is considered as final output as shown in Figs. 3 and 4.

Fig. 3
figure 3

Stages of document clustering using parallel k-means

Fig. 4
figure 4

Parallel k-means algorithm with MapReduce

To settle the number of cluster and initial centroid, the parallel k-mean algorithm is modified which can be optimized using fuzzy logic, gravitational intelligence, and swarm optimization. Big data has its own challenge in terms of storing the data and retrieving it fast. Manual grouping of files is very complex when there is a huge amount of document. A new working k-means non-negative matrix factorization (KNMF) with modified guideline of non-negative matrix factorization [8, 9] is used for document clustering. Comparison study of iterated Lovins algorithm, Lovins algorithm, and Porter algorithm of text mining shows that maximum words are stemmed in iterated Lovins algorithm. Therefore, the characteristics of k-means non-negative matrix factorization help in clustering the documents with parallel implementation of MapReduce on large sized documents. This results in quick and easy clustering as well as less time consumption.

In order to shrink the computational time, HDFS, MapReduce, and clustering algorithms are used by distributing the clustering jobs on multiple nodes which means multiple clustering tasks run parallel on different nodes. A comparative review of components of Hadoop and MapReduce has been studied to compare result with the traditional partition-based algorithms with their implementation in MapReduce paradigm to achieve various clustering objectives on different size of data sets [7]. Introduction of combiner programs between the map and reduce function helps in reduction of volume of data to be written by mapper and volume of data to be read by reducers that decrease the overall operation time. The time reduction is highly realizable when the number of document is huge rather than smaller data sets [1]. The model for implementation of parallel k-means clustering in MapReduce without a combiner is shown in Fig. 4.

With above working methods, the advantage of the ability of global search in particle swarm optimization (PSO) is used for optimal generation of centroids. The power of parallel processing with global search supports data intensive distributed application with improved accuracy in generating compact clusters [4]. Some more literatures are studied in context of text mining and a comparative study is presented in Table 1 in terms of their objective, findings, and methods used. From the study, it is clear that MapReduce is the most popular technology to handle text mining problems. Therefore, in our context of research, we have implemented the proposed text mining model using Hadoop MapReduce with partitioner.

Table 1 Comparative study of recent works in text mining

3 Proposed Decision Feedback-Based Text Mining Model

Textual content is typically available in comprehensive document format. These formats can be e-mails, text file lettering, user feedback, sentimental comments, corporate reports, sentimental comments, news reporting, Web pages, etc. The proposed text mining model tries to first instigate a quantitative representation of document and then transfer the document into a set of numbers where the numbers adequately capture the patterns of textual data. Any traditional statistic model, forecasting model, and analytical algorithm can be used on these numbers for generating insights or to produce a predictive modeling. Statistical-based systems count the word frequency of each word and calculate their statistical proximity toward related conceptual indexes. These systems may produce inappropriate concepts and miss the required words which in turn reduces the prediction model accuracy. Iterative text mining decision feedback model is the advanced form of a text mining where the process is repeated till the result is acceptable without getting completely out of process. In this model, the feedback block is the controller of number of iterations. Feature selection, data analytics, and evaluation phase constitutes the feedback block. Also, this process minimizes the interference of irrelevant words to increase the model accuracy. The iterative text mining model which we have proposed has the design as shown in Fig. 5. The steps involved in this model are:

Fig. 5
figure 5

Proposed model of text mining

Data Collection

Collecting an unstructured data set for analysis is always the first step of any text mining process.

Text Parsing and Transformation

In this step, the data set is cleaned and a dictionary of words is created from the document using NLP. This includes identification of sentence, word, parts of speech, and stemming words [1]. The extraction of each word from document is associated with a variable for further reference in the process.

Text Filtering

In the parsed document, there will be some words which are not relevant to the mining process and those words need to be filtered out from the document called as word stopping and word stemming [14, 15]. This process requires an in-depth knowledge of the domain. Number of word stemmed are denoted by “S”. The word stemming process has been discussed in more detail in further sections.

Text Transformation

After text filtering, the document is presented by the occurrences words contained in it. After transformation, a document can be represented in two ways such as

  • A simplified representation used in information retrieval and natural language processing where it contains the multiset of words irrespective of grammar is known as a bag of word. It is a JSON object representation.

    Bow1 = {“John”: “3”, “is”: “1”, “Good”: “5”}

  • Vector space model is an algebraic representation of text involving two steps. First, the document is represented in a vector of words and then the vector is transferred into a numerical format where the techniques of text mining can be applied. In this research, the documents have been represented in a vector space mode.

Feature Selection

It is also known as variable selection in which we select a subset of more important features to be considered in the model creation. Irrelevant and redundant features are not to be used in model creation to improve the model accuracy.

Data Mining

At this stage, the traditional data mining process is merged with text mining. Classical data mining techniques are used for clustering of the data that obtained from the quantitative representation of document to be associated in further evaluation steps. K-mean clustering [9] or a parallel k-mean clustering [13] technique is taken into consideration in this phase.

Evaluate

In this step, we evaluate the mining result. After evaluation, if the result is not acceptable, then we discard the result and continue the process as an iterative model to get the best results. Once the result is acceptable, we proceed to next step.

In this step, word stem factor (WSF) is calculated to decide the result acceptance. Word stem factor (WSF) is defined as the percentage of number of word stemmed to total number of distinct word. Word Stem Factor (WSF) = (S/T) * 100

Application

The evaluated model is now have a broader area of application in the different text mining process. This model is ready as a product to be deployed in real-life problems. The model can be applied in web mining, E-consultation in medical, Twitter data analysis, and resume filtering.

4 Big Data Technologies

In hope of using data in future organizations collect and store by organizations store enormous amount of data. A number of significant global challenges have been notified as revolution in big data technologies [16]. The way organizations are collecting, using, managing, and leveraging data using big data technologies is ways beyond of imagination. In this research, we have focused on the most popular big data technology—Hadoop. It is one of the most sophisticated and ever growing ecosystems in the era of big data. Different technologies of Hadoop ecosystem have been briefly discussed.

4.1 Hadoop Distributed File System

To store huge amount of data in cluster of computers and to channel them to the required applications at a high bandwidth Hadoop distributed file system (HDFS), it is used inside Hadoop ecosystem. Large cluster constituting hundreds and thousands of server nodes built of commodity hardware to execute user application tasks [16, 17]. Storage and computation are distributed across servers and the system provides a technique of parallel processing and the required resource for each node have the capability to grow with demand while cost remains economical at every size. Data is stored in files and files are placed on nodes providing replication for fault tolerance. Some unique features of HDFS are highlighted below.

  • Physical location of node is considered in rack awareness for storage allocation and task scheduling.

  • Minimal data motion that process is moved to data rather than moving data to process. This technique reduces bandwidth.

  • The previous versions of storage are restored using standby name node and secondary name node in case of human or system errors.

4.2 MapReduce

As a parallel processing framework, Hadoop MapReduce is used for processing huge amount of data in very less time. Large amount of data are processed clusters containing thousands of node built from commodity hardware. The cluster is highly reliable and fault tolerant. Job tracker is the single master nose and multiple task trackers acting as slave node constitutes the initial architecture of Hadoop framework. Whereas yet another resource negotiator (YARN) is the advanced Hadoop architecture [10, 13]. Resource manager is responsible for job scheduling on slave nodes, monitoring the task execution, and re-executing the failed tasks. Some more advantages of MapReduce are mentioned below.

  • Commodity hardware is added to the existing server to increase the capacity is also known as scale-out architecture or horizontal scaling.

  • Failed tasks are automatically recovered proving the fault tolerance of cluster.

  • Flexibility for amount of file systems and facility of serialization in multiple open frameworks.

  • Intelligent data placing technique to maintain the load balancing with maximum utilization and efficiency.

A MapReduce process is shown in Fig. 6. The data split files are executed in mapper parallely. After mapper phase is completed, the interim results are sorted and shuffled. Then, the results are merged and fed to the reducer. The no of reducer is defined in the MapReduce program determines the number of output files.

Fig. 6
figure 6

MapReduce processing

4.3 Pig

Pig is a data flow tool used for analyzing large data sets. It is not specific to Hadoop only rather it can be used with any parallel data processing. Though it can support all types of data that is structured, semi-structured, quasi-structured, and unstructured data but very frequently used for structured and semi-structured data. It uses Pig Latin language [18]. Each line in Pig code is converted to a logical plan and series of MapReduce tasks. It creates a directed acyclic graph (DAG) for each job. Features:

  • Pig provides ease of programming where developers have to write less number of coded than MapReduce for a particular requirement.

  • In case built-in functions are not available, users can create custom programming which can be easily integrated with Pig.

4.4 Hive

As a data warehouse software Apache Hive inside Hadoop ecosystem helps to query, analyze, and manage large data sets stored distributed storage (HDFS). It provides the facility of HIVEQL, an SQL-like language for querying and retrieving data. All the Hive queries are converted into MapReduce job by Hive engine automatically and implicitly. When it is difficult to express logic in HIVEQL, it allows MapReduce programmers to be plugged in with Hive using custom mappers and reducers [17].

Hive allows indexing to provide acceleration in data search. Compaction and Bitmap indexes are also applicable in Hive. It supports different file types like plain text, RCFile, and ORC. It can operate on compressed data storages using GZIP, BZIP2 an SNAPPY. User-defined functions (UDFs) are supported by Hive when built-in functions are not available.

4.5 Sqoop

The facility to transfer data between HDFS and RDBMS (MySQL and ORACLE) is provided by Sqoop inside the Hadoop ecosystem. It imports data from RDBMS to HDFS to process it and again export the data to RDBMS [18]. It facilitates the connection of different database servers, controlling of import and export process. It can import data to Hive and HBase.

4.6 Oozie

Hadoop ecosystem provides the facility of a Web application based on Java used for scheduling Hadoop jobs is known as Apache Oozie. It sequentially combines multiple jobs to one logical unity of work. It supports MapReduce jobs, Pig scripts, Hive query, and Sqoop import exports. Jobs of a specific system like Java or a shell script program can also be scheduled in Oozie. Oozie workflow and Oozie coordinator are two categories of Oozie jobs. Multiple workflow and coordinators are bundled in Oozie to manage the lifecycle of running jobs. It is scalable and reliable.

4.7 Flume

Flume is used for efficient collection, aggregation, and movement of large amount streaming data like record logs. It has failover and recovery mechanism and it is used for online analytic application. Flume has a new data set sink Kiite API that is used to write data to HDFS and HBase.

4.8 ZooKeeper

Zookeeper is a centralized configuration and synchronization service in Hadoop ecosystem [17]. Every time a service is scheduled a lot of configuration need to be changed and resources are synchronized and this makes the service more fragile. Zookeeper is very fast with workloads with the ideal read-write ratio of 10:1. It can be replicated over multiple servers to avoid single point of failure.

5 Word Stemming

In context of information retrieval and linguistic morphology stemming, it is the process of tumbling any transformed word to its original stem word. Stem word is the base or morphological root form of any word. Stemming is a process that maps all related words to its stem [14]. Word stemming is an essential part of natural language processing and it is done by removing any suffix or prefix attached to the stem word. This conversion is also required in text clustering, categorization, and summarization as part of pre-processing in text mining.

5.1 Pre-requisites for Stemming

Word stemming requires tokenization and filtering from the document first. These two processes bring the document into the granular level required for word stemming.

Tokenization

In tokenization, a document is split into a set of word based on some tokenizer or separator [12]. The separators can be a blank space or any special character. An example is illustrated as below.

Text = “Science brings the society to the next level.”

The output of tokenization assuming blank space (“ ”) as a separator: [“Science”, “brings”, “the”, “society”, “to”, “the”, “next”, “level”].

The punctuation marks and non-text characters are removed from the document in tokenization. Hence, the words are finally converted to nouns, verbs, etc. Another approach of word tokenization is focused on the statistical distribution of the words inside the document instead of following the occurrences of words. In the statistical analysis, it is important to index the texts into vectors. In this research, as the bag of word (BOW) approach has been adopted part of statistical representation of document.

Filtering

This process removes the words which are not important for text mining process or which may degrade the result of analysis and it is also called as stop word filtering. Stop words [19] are the words which are not required in the text mining process. This filtering is controlled as per the requirement, i.e., a strong stop word list will create the best result in text mining process. The stop word lists are available in World Wide Web. One of the resource available in http://www.lextek.com has been considered in this research [20].

5.2 Classification of Stemming

Stemming algorithms are broadly classified into three groups. The classification of stemming algorithm is shown in Fig. 7.

Fig. 7
figure 7

Classification of stemming

Truncating Method

This method removes the prefix and suffix of a word. Truncate (n) is the most basic stemming algorithm in which each nth position word is truncated and words existing in positions less than n are not truncated as well as no stemming rule is applied on them. Therefore, the chance of over stemming is increased. Another stemming algorithm where the plural words are transferred to singular form by removing the suffix ‘S’ [14]. There are four types of algorithm in truncating method as highlighted below.

Lovins Stemmer

This algorithm contains 294 ending rules, 35 transformation rules, and 29 conditions. The longest suffix from any word can be removed by this stemmer. After removing the suffix from the word, the word is referenced with different tables to convert it to a valid stem or root word after making some adjustments [15]. As a single pass algorithm maximum of one suffix is removed from a word. This algorithm can transfer the double letter words like “setting” to its original stem words very fast, i.e., “set” and also handles many asymmetrical plural forms to their singular transformations, for instance, “feet to foot,” “men” to “man,” etc. Lovins stemmer algorithm consumes more data and many suffixes are not available in the ending rules. Sometimes, it is very unreliable as it cannot match the stems of similar meaning.

Porter Stemmer

Porter stemmer algorithm was proposed in 1980. Many modifications have been suggested and done on the elementary algorithm. There are 1200 suffix rules in the algorithm having five steps in each rule. The algorithm is iterated through the rules until one of them is accepted. Once a rule is satisfied, the suffix from the word is removed, then the resultant stem word is returned and next step is performer [15]. Also, there are 60 comprehensive conditions in this algorithm in the form of <Conditional Rules> with <Suffix> constitutes a <New Suffix>. For example, if a word ends with “EED” and has at least one consonant and vowel then the suffix can be changed to “EE.” For instance, “Emceed” will be changed to “Emcee” but “Speed” will remain as it is. Porter stemmer algorithm is designed as a detail stemming framework where the key intension of the framework is that the programmers can develop new stemming rules for different sets of suffix.

Paice/Husk Stemmer

It contains 120 rules indexed by suffixes and is iterative in nature. In each iteration, algorithm tries to find a match with the suffix and then either the suffix is deleted or it is replaced. Advantage of this algorithm is that it takes care of both deletion and replacement. But this is a very heavy algorithm which may create over stemming error [new paper].

Dawson Stemmer

This is an extension of Lovins stemmer algorithm which have a 1200 extensive list of suffix transformation [15]. It is also a single pass algorithm therefore it is fast. The suffix is stored in reverse order indexed by their length and last letter.

Statistical Method

These types of stemming algorithm remove the affixes (suffix and prefix) after applying any statistical analysis and technique. N-Gram stemmer, HMM stemmer, and YASS stemmer are statistical stemming algorithms. N-Gram stemmer is language independent and is based on n-gram and string comparison [14]. HMM stemming algorithm is unsupervised and language-independent stemming and it is based on hidden Markov model. YASS stemming corpus based and can be implemented without knowing the morphology. It uses hierarchical clustering and distance measure approach.

Mixed Method

This type of stemming algorithms are composition of inflectional and derivational morphological methods, corpus-based methods, and context-sensitive methods [15]. As part of inflectional methods, the algorithms are correlated to syntactic variations such as plural, cases, and genders of a specific language. Krovetz and Xerox stemmers are example of inflectional and derivational methods. Corpus-based methods use the occurrences of word variants. Some drawbacks of Porter stemmer algorithm have been taken care here like “Iteration” is not converted to “Iter” and “General” is not converted to “Gener”.

6 Proposed Porter Stemmer with Partitioner Algorithm (PSP)

This algorithm has about 60 rules which can be coded using MapReduce. When “Partitioner” technique is applied with all the porter rules, it provides better result. In partitioner of MapReduce, multiple partitions [21] are created based on conditions for data before data goes to reducer. The simplest partition technique is a hashing partition, but based on the condition, we can create required number of partition. For example, if special characters are not required for text mining process, then we can separate them in one partition and other alphabets and numbers will be in another partition. For this technique, the number of reducers needs to be set in the MapReduce program. Figure 8 shows the model for the proposed algorithm Porter stemmer partitioner which combines the rules of Porter stemmer implemented in MapReduce partitioner.

Fig. 8
figure 8

Proposed Porter stemmer algorithm with partitioner (PSP)

7 Hadoop Cluster Operation Modes

For this research, the selected documents have unstructured format of data. Therefore, Hadoop MapReduce and HDFS have been chosen for implementation. The selected documents are stored in HDFS and a MapReduce program is run on each document parallel [22]. For the purpose, a Hadoop cluster with Hadoop Archirtecture-2 has been set up. A Hadoop can run in three different modes as shown in Fig. 9.

Fig. 9
figure 9

Hadoop cluster operation modes

Standalone Mode

Standalone mode is the default operation mode of a Hadoop cluster also known as local mode. In this mode, none of the demons like name node, resource manager, secondary name node, data node, and node manager run inside the cluster. Therefore, it is mainly used for learning, debugging, and testing [23]. In this mode, the cluster runs faster than the two other modes. In this mode, HDFS storage architecture is not utilized, so it is like a system having the same kind of storage as in windows like an NTFS or FAT32 system. When this mode starts to run none of the configuration files like mapred-site.xml, hdfs-site.xml, core-site.xml, etc., are needed. All the processes run in a single JVM in this mode.

Pseudo-Distributed Mode

In pseudo-distributed operation mode, all the demons run on a single node. This mode is a simulation of the cluster, therefore, all the processes run independently. Name node, resource manager, secondary name node, data node, and node manager run on separate Java virtual machines (JVMs) inside a single node. This mode mimics the operation of fully distributed mode on a single node [23].

The master-slave architecture of Hadoop cluster also exists in this mode is handled by a single system. Resource manage and name node are run as master, whereas data node and node manager run as slave. The secondary name node in this mode is used to handle the hourly back up of the name node. When this mode starts to run the configuration files (core-site.xml, mapred-site.xml, and hdfs-site.xml) need to be set up in the environment.

Fully Distributed Mode

This is the production mode of Hadoop cluster where multiple nodes are used. Some of the nodes run master demons resource manager and name node, whereas rest of nodes in the cluster run slave demons node manager and data node. The HDFS storage architecture is fully followed here therefore the files are stored on multiple nodes [23]. The configuration parameters of the cluster environment need to be specified in this mode. This mode is highly scalable supporting both horizontal and vertical scaling. Also, this mode is completely reliable, fault tolerant and have the full capability of distributed computing.

Standalone mode has a very limited scope, whereas fully distributed mode is highly expensive and need a lot of configurations to be handled. Therefore, for this research, a pseudo-distributed cluster mode has been chosen. The chosen mode is a Horton works pseudo-distributed cluster running on Hadoop-2 architecture.

8 Environment Setup

A Hadoop cluster has been set up for implementation taking the Hortonworks Hadoop 2.2 version. It provides a command line interface to interact with the cluster and an easy accessible Web interface for displaying cluster-related informations.

Commands to make up the Hadoop cluster [24]. Figure 10 shows the Hadoop version installed on the cluster.

Fig. 10
figure 10

Installed Hadoop version

As the used Hadoop architecture is a second generation architecture, five demons always run on the cluster to make it operational [25, 26]. The running demons are shown in Fig. 11.

Fig. 11
figure 11

Running demons on Hadoop cluster

  • Name node

  • Data node

  • Node manager

  • Resource manage

  • Job history server.

Information about the name node are shown in Figs. 12 and 13. The name node runs on port 8020. There are total 38 blocks in the cluster. The cluster have 10.60 GB storage for Hadoop distributed file system out of total ~18 GB storage. Figure 14 shows internal storage structure of HDFS. This server has a block size of 128 MB and the files are stored as part files inside the blocks of HDFS. Part files are the logical partitioning of a bigger data set [24, 26]. The replication factor of cluster is set to 1, therefore, every file is present in a single rack only according to the rack awareness of Hadoop.

Fig. 12
figure 12

Name node information-1

Fig. 13
figure 13

Name node information-2

Fig. 14
figure 14

HDFS storage structure

9 Implementation

Implementation of this research follows all the steps of proposed text mining model. Implementation of this research has compared the stemming performance of Lovins stemmer algorithm, Porter stemmer algorithm, and Porter stemmer with partitioner algorithm.

9.1 Data Collection

Three different data sets have been considered for the implementation of this research. All data sets are of different sizes and have different structures of data as described below.

Data Set-1 (CV Data Set)

A CV structure has been considered as the first and smallest data set for this research. It has the text that is relevant to a CV like technologies, expertise, work experience, etc. This data set has been collected from an open source [27] of size 2 KB and total 260 words. The data set has the text data so it is unstructured in nature. A portion of the data set has been shown in Fig. 15.

Fig. 15
figure 15

Data set-1

Data Set-2 (Speech data set)

Speeches have the most complex linguistic morphology. The second data set has been considered as a speech data set of PMO India on 72nd Independence Day. Data has been collected from official site [28] of the India’s Prime Minister. The data set has total 8000 words and unstructured in nature. A part of the data set is shown in Fig. 16.

Fig. 16
figure 16

Data set-2

Data Set-3 (Twitter data set)

The third data set has been collected from the American microblogging site Twitter [29]. The data set contains social media comments and is the largest data set considered for this research. It have total ~52,00,000 words and is of 185 MB. A part of data set is shown in Fig. 17.

Fig. 17
figure 17

Data set-3

After the data sets are collected, they are transferred to HDFS, because data has to be present in HDFS for MapReduce processing. Command used to move data from local storage to HDF are given below.

  • To check if file exists in local storage—“ls”.

  • To move file from local storage to HDFS—“hdfs dfs-copyFromLocal/home/local/textdata/textmining”.

Figure 18 shows the data sets presence inside the HDFS.

Fig. 18
figure 18

Data sets in HDFS

9.2 Text Parsing

Text parsing is a technique to read the input data set and break it into granular levels which is a word. Text parsing is a logic that performs the above task inside a MapReduce program [11]. As data in all the data set is separated by space we have used line offset value and string tokenizer [12] to parse the data sets. Parsing of data sets produce a bag of words (BoW) pseudocode for text parsing is defined as TEXT-PARSING(A).

For example assuming “My name is xyz” is a line.

Tokenize converts the line into an array of words by splitting them based on blank space. For the above example, tokenizer will create [‘My’, ‘name’, ‘is’, ‘xyz’] for a line “My name is xyz”.

Text parsing is done in mapper side and all the further steps are done in reducer side.

9.3 Text Filtering

Text filtering removes the unexpected words from the bag of words. It is done by passing the tokenized array into a stop word filter. For example, the word “an” do not contribute any morphological interpretation in the analysis. Therefore, it needs to be removed from the analysis. When a word passes through the stop word list than it is checked for its presence in the list [12]. If the word is present in the stop word list, then it is removed from the bag. Pseudocode for text filtering defined as TEXT-FILTERING(A) is explained below.

After stop word removal three stemming algorithms such as Lovins stemmer, Porter stemmer, and proposed partitioned Porter stemmer (PSP) have been used for implementation of word stemming process. The proposed PSP stemmer takes care of the punctuation marks, special characters, etc., which are also not relevant to our analysis. Therefore, in the proposed PSP stemmer, a partitioner program has been used in this research. It removes all the special characters, punctuations into an unused partition, and only consider the words containing alphabets. Post-partitioning the terms are passed to the stop word list. Number of word stemmed are denoted by “S”. The comparison of stemming results has been discussed in further sections.

9.4 Text Transformation

After the text filtering, the document is now converted to a numerical matrix form call as document matrix [19]. The symptomatic presentation of different terms has been explained below.

Term Frequency (λ)

It is defined as the total occurrence of a stem word with respect to the total number of distinct words present in a document. Whereas the total occurrence of a root or stem word is defined as Term Count.

Term Count = Total count of existence of a stem word in a document.

Term frequency = Total count of occurrence of a word in document/Total Number of Word in document.

If N = Total number of word in document.

T = Term count then term frequency (λ) = T/N

Document Matrix

It is a numerical representation of the document. After finding the term frequency [12] of each unique term in document, the document is presented in form of matrix as [Word (Term) Term frequency (F)] which constructs the document matrix. Figure 19 shows a portion of document matrix of data set-2.

Fig. 19
figure 19

Document matrix of data set-“2”

Word Stem Factor (α)

The percentage of total number of word stemmed with respect to the total unique word present in a document is defined as word stem factor. Algorithms providing higher percentage of stemming are known as Dense Stemmer.

S = Total number of word stemmed

U = Total unique word present in document

α = (S/U) * 100

Stop Word Factor (β)

It is defined as the percentage of total number of word stopped with respect to the total unique word present in a document.

X = Number of stopped words

β = (X/U) * 100

Cumulative Word Stem Factor (γ)

It is defined as the percentage of total number of word stemmed and stopped with respect to the total unique word present in a document.

γ = ((S + X)/U) * 100

and γ = α + β

9.5 Feature Selection

In this research, attributes such as “term,” “frequency,” “word stem factor,” and “stop word factor” have been considered.

9.6 Evaluate

The values obtained from the model has been accepted to complete the evaluation process. The values obtained have been analyzed in further sections.

10 Result and Discussion

Table 2 shows a comparative study for word stemming capacity of the three stemming algorithms. From the result, it is clear that Porter stemmer with partitioner algorithm provides dense stemming than Lovins stemmer and Porter stemmer. Also, PSP is more accurate in stop word filtering. This performance improvement is applicable to documents of all sizes. Figure 20 shows the graph plotted for word stem factor with respect to the different stemming algorithms. From the graph, it is observed that with increase of data set size, PSP algorithm shows better result which resolves the big data volumetric issue [18], i.e., the model provides the better result when operate on huge data set. Similarly, a graph is plotted between the stop word factor and the stemming algorithms as shown in Fig. 21. From the figure, it is observed that a Porter stemmer algorithm when operated with a partitioner provides better stopping capability than Lovins stemmer and Porter stemmer. Another graph between cumulative word stem factor and stemming algorithm is shown in Fig. 22. The plot clearly points toward the better performance of Porter stemmer with partitioner than the other stemming algorithms according to the increased size of data sets. The accuracy of analysis depends on the stop word list and word stemmed. So, the stop word list is continuously updated for better results.

Table 2 Comparison of stemming results
Fig. 20
figure 20

WSF versus stemmer

Fig. 21
figure 21

SWF versus stemmer

Fig. 22
figure 22

CWSF versus stemmer

11 Conclusion and Future Work

From the above result analysis, it is clear that Porter stemmer algorithm with Hadoop MapReduce partitioner provided better result than Lovins stemmer and traditional Porter stemmer algorithm. Therefore, PSP algorithm can be used with big data to create an operation module which can be used in industrial applications, health care, social media, etc. The Porter stemmer with partitioner is capable of providing better result for huge amount of data sets than other stemming algorithms. The proposed methodology also has an extensible capability of reducing unnecessary words from the text mining and also has the capability to reduce the error in the following an iterative approach. The model can be used for CV filtration, online exam evaluation of subjective question answer, sentiment analysis, etc. In future, the model and algorithm will be implemented in other application domains such as health care and the obtained results will be compared. Also, the optimization techniques like particle swarm optimization (PSO) will be applied to enhance the model.