An Implementation of Text Mining Decision Feedback Model Using Hadoop MapReduce

Khatai, Swagat; Rautaray, Siddharth Swarup; Sahoo, Swetaleena; Pandey, Manjusha

doi:10.1007/978-981-33-6815-6_14

Swagat Khatai⁵,
Siddharth Swarup Rautaray⁵,
Swetaleena Sahoo⁵ &
…
Manjusha Pandey⁵

Part of the book series: Studies in Computational Intelligence ((SCI,volume 954 ))

1061 Accesses

Abstract

A very large amount of unstructured text data is generated everyday on the Internet as well as in real life. Text mining has dramatically lifted the commercial value of these data by pulling out the unknown comprehensive potential patterns from these data. Text mining uses the algorithms of data mining, statistics, machine learning, and natural language processing for hidden knowledge discovery from the unstructured text data. This paper hosts the extensive research done on text mining in recent years. Then, the overall process of text mining is discussed with some high-end applications. The entire process is classified into different modules which are test parsing, text filtering, transformation, clustering, and predictive analytics. A more efficient and more sophisticated text mining model is also proposed with a decision feedback perception in which it is a way advanced than the conventional models providing a better accuracy and attending broader objectives. The text filtering module is discussed in detail with the implementation of word stemming algorithms like Lovins stemmer and Porter stemmer using MapReduce. The implementation set up has been done on a single node Hadoop cluster operating in pseudo-distributed mode. An enhanced implementation technique has been also proposed which is Porter stemmer with partitioner (PSP). Then, a comparative analysis using MapReduce has been done considering above three algorithms where the PSP provides a better stemming performance than Lovins stemmer and Porter stemmer. Experimental result shows that PSP provides 20–25% more stemming capacity than Lovins stemmer and 3–15% more stemming capacity then Porter stemmer algorithm.

Access provided by Autonomous University of Puebla. Download chapter PDF

Performance of Multiple String Matching Algorithms in Text Mining

DGMS: Dataset Generator Based on Malay Stemmer Algorithm

An Adaptive Information Retrieval System for Efficient Web Searching

Keywords

1 Introduction

In today's world, the amount of unstructured data is growing in an enormous way that the existing relational systems are incompetent in handling them. The form of data can be audio-video clips, textual data, software program logs, flight records, etc. The information hidden inside those data leads to a complete new world of opportunity and insight. This is the reason for why every organization and individual is demanding to explore these huge amount of data, which constructs the foundation of text mining. It is also called as a practice of textual form of data to discover the key conceptions, themes, hidden trends, and relationships without prior knowledge of exact terms that has been used by author to express the concept [1]. As part of text mining algorithms of data mining, text analytics, machine learning, natural language processing, and statistics are used to extract high quality, useful information from unstructured formats. Test mining is also popular as “text analytics” is a means by which unstructured data is processed for machine use. For example, if a Twitter comment “I don’t find the app useful: it’s really slow and constantly crashing.” is taken into consideration then text mining of the contextual information is really important to help us understand why the tone might be negative and what may be the cause of such customer disappointment as shown in Fig. 1. These analyses may lead to the answer of questions like “Is the person replying to another negative tweet? or is this the original composition? or what is the application name? or is this the only problem with the app or there are other problems too?, etc.

1.1 Conventional Process Flow of Text Mining

Textual data are in the form of unstructured data are normally available in readable document formats. These formats can be user comments, e-mails, corporate reports, web pages, news articles, etc. According to conventional text mining process, the documents are first derived into a quantitative representation. Once the textual data is transformed into a set of numbers which precisely capture the hidden pattern in it, then any data mining algorithm or statistical forecasting model is applied on the numbers for generating insights or for discovering noble facts [2, 3].

A typical text mining process generally have the following sub-tasks to complete the process.

Data Collection

Collection of textual data is the first step in any text mining research [3].

Text Parsing and Transformation

The next step is to parse the words from the documents. Therefore, sentences, parts of speech, and stemming words [3] are identified from the document. Document variables associates with author, category, gender, etc., are also extracted with the parsed words.

Text Filtering

After the parsing of words, there may be some irrelevant words which are not required in the analysis, and those words are removed from the document. This is done manually by browsing through the terms or words. This is the most time-consuming and subjective tasks in all of the text mining steps. A fair amount of subject knowledge and domain knowledge is required to perform this task. In case of document filtering [3], the selected keywords are searched in all the selected documents. If any document does not contain any of the keywords, then it is removed from the list of analysis.

Text Transformation

In this step, the document is presented in a numerical form of matrix [3]. This matrix generally contains the occurrences of the words is also called as term frequency. Numerical presentation of the document is mandatorily required to perform any kind of analytics on the document. Therefore, this step converts the unstructured text to a workable analytical document.

Text Mining

In this step, hidden patterns and knowledge are extracted using mining algorithms such as classification, clustering, association analysis, and regression analysis. As shown in Fig. 2, text mining is an iterative process where the process of filtering to mining is repeated based on the feedback received from this step [4].

1.2 Applications of Text Mining

Text mining process is being used to provide answers to industrial queries and to optimize daily operations efficiently. It is also used to develop business strategic decisions in finance, automobile, marketing, health care, etc. Hidden patterns, trends, and perceptions are discovered from a huge volume of unstructured data using techniques like data analytics, categorization, and sentiment analysis. In this research, we have discussed below applications of text mining.

Risk Management

Inadequate risk estimation is accounted for biggest reason of failures in any industry. In these cases, text mining is used to estimate the proper risk in business and also to identify the most adequate way to mitigate the risk [3]. Therefore, the application of text mining software has drastically increased the capacity of risk mitigation in industries.

Knowledge Management

Managing huge volume of data containing the historical information creates many problems like huge storage space, latency in finding specific information, etc. The healthcare industries are a classic example for the above problems where the information of historical patients’ data can be potentially used for medical analysis and product development [3]. Therefore, text mining is used to filter the useful informations by discarding the irrelevant ones. Then, many analytic algorithms are run on the filtered data to find and store the extracted unknown facts only, which reduces the storage issue, latency issue, etc.

Cybercrime Prevention

Random availability of information over Internet can bear the brunt of cybercrimes. Text mining is used to trace the cybercrime activities and also helps to identify the source of intruders [3]. Therefore, text mining is used by law enforcement and intelligence agencies.

Customer Care Service

Customer care services are better operated using text mining and natural language processing. Text analytics software improves customer experiences. These analytics use many valuable information sources such as survey and customer call notes which help effectiveness and speedy resolution of customer problems [3]. Test mining is also used for automated faster responses to customer queries.

Contextual Advertising

Digital advertising has got a new height of safety and user’s privacy by applying text mining as core engine of contextual retargeting [3]. It also provides better accuracy in contextual advertising.

Business Intelligence

Text mining is uses to support faster decision making by taking consideration of valuable enterprise data [3]. It helps to find future insights for improving the business by monitoring huge number of data sources.

Social Media

Social media is a potential source of huge amount of unstructured data inside which a lot of hidden patterns related to business, sentiment [5], and intelligence are there. Many organizations predict the future customer needs using text analytics. This information help organizations to extract the customer opinions, to understand their emotions, and also to predict their requirements. Text mining has made revolutionary modifications in social media.

2 Literature Survey

As text mining is our focus of research, therefore, some recent research artifacts are studied. All the related studies and analysis points that application of big data technologies like Hadoop MapReduce, k-means, particle swarm optimization (PSO), and cloud computing provides better result, reduced execution time and better solution for big data problems. Large data sets can be analyzed using Hadoop cluster and parallelization of clustering algorithms and using parallel k-means clustering provides a drastic reduction in execution time [1]. Document clustering, parallel k-means, and distributed computing [6] are the techniques that have been used with Hadoop MapReduce in the study. After selecting centroids randomly, every document is fed to one mapper. The mapper calculates the new centroids based on the Euclidean distance. The result of all mappers is sent to a reducer to calculate a resulting centroid which then compared with the assumed centroid [7]. If there is a difference in centroid value, then the process is iterated, otherwise, the centroid is considered as final output as shown in Figs. 3 and 4.

To settle the number of cluster and initial centroid, the parallel k-mean algorithm is modified which can be optimized using fuzzy logic, gravitational intelligence, and swarm optimization. Big data has its own challenge in terms of storing the data and retrieving it fast. Manual grouping of files is very complex when there is a huge amount of document. A new working k-means non-negative matrix factorization (KNMF) with modified guideline of non-negative matrix factorization [8, 9] is used for document clustering. Comparison study of iterated Lovins algorithm, Lovins algorithm, and Porter algorithm of text mining shows that maximum words are stemmed in iterated Lovins algorithm. Therefore, the characteristics of k-means non-negative matrix factorization help in clustering the documents with parallel implementation of MapReduce on large sized documents. This results in quick and easy clustering as well as less time consumption.

In order to shrink the computational time, HDFS, MapReduce, and clustering algorithms are used by distributing the clustering jobs on multiple nodes which means multiple clustering tasks run parallel on different nodes. A comparative review of components of Hadoop and MapReduce has been studied to compare result with the traditional partition-based algorithms with their implementation in MapReduce paradigm to achieve various clustering objectives on different size of data sets [7]. Introduction of combiner programs between the map and reduce function helps in reduction of volume of data to be written by mapper and volume of data to be read by reducers that decrease the overall operation time. The time reduction is highly realizable when the number of document is huge rather than smaller data sets [1]. The model for implementation of parallel k-means clustering in MapReduce without a combiner is shown in Fig. 4.

With above working methods, the advantage of the ability of global search in particle swarm optimization (PSO) is used for optimal generation of centroids. The power of parallel processing with global search supports data intensive distributed application with improved accuracy in generating compact clusters [4]. Some more literatures are studied in context of text mining and a comparative study is presented in Table 1 in terms of their objective, findings, and methods used. From the study, it is clear that MapReduce is the most popular technology to handle text mining problems. Therefore, in our context of research, we have implemented the proposed text mining model using Hadoop MapReduce with partitioner.

Table 1 Comparative study of recent works in text mining

Full size table

3 Proposed Decision Feedback-Based Text Mining Model

Textual content is typically available in comprehensive document format. These formats can be e-mails, text file lettering, user feedback, sentimental comments, corporate reports, sentimental comments, news reporting, Web pages, etc. The proposed text mining model tries to first instigate a quantitative representation of document and then transfer the document into a set of numbers where the numbers adequately capture the patterns of textual data. Any traditional statistic model, forecasting model, and analytical algorithm can be used on these numbers for generating insights or to produce a predictive modeling. Statistical-based systems count the word frequency of each word and calculate their statistical proximity toward related conceptual indexes. These systems may produce inappropriate concepts and miss the required words which in turn reduces the prediction model accuracy. Iterative text mining decision feedback model is the advanced form of a text mining where the process is repeated till the result is acceptable without getting completely out of process. In this model, the feedback block is the controller of number of iterations. Feature selection, data analytics, and evaluation phase constitutes the feedback block. Also, this process minimizes the interference of irrelevant words to increase the model accuracy. The iterative text mining model which we have proposed has the design as shown in Fig. 5. The steps involved in this model are: