Keywords

1 Introduction

Text mining can help an organization to acquire potentially valuable business insight. Text mining from unstructured data might be a challenging task as natural language processing; statistical model and machine learning techniques are often required due to inconsistent syntax and semantics of natural language text, including slang and language specific to vertical industries [1].

Industries dispose of a large amount of production databases, where huge amounts of data are stored. A huge amount of company information (approximately 80%) is available in textual data formats, which is why the text mining is useful and important in industrial management [9].

The automotive industry is more data-driven today than at any time in its history. In-car sensors, GPS tracking, automated manufacturing processes, and more are producing vast volumes of data that need to be analysed and understood. RapidMiner’s Predictive Analytics platform enables car makers to derive value from this data by extracting the information hidden online, inside the vehicle, or at the plant with the purpose of better understanding product usage, preferences, and manufacturing processes to ensure quality and customer satisfaction [10].

Our research is based on production data in text documents, especially forms and the goal of our research is executing production data examination and further analysis in automotive industry using RapidMiner, an open source tool useful for text mining.

2 Text Mining and Solution Proposal

2.1 Knowledge Discovery and Methodology

Knowledge discovery in process automation represents extracting knowledge from different types of data, for example – KDD: Knowledge Discovery in Databases, KDT: Knowledge Discovery in Text or TM: Text Mining. Knowledge Discovery in Text is also called Text Mining. Due to the unstructured language, the process of knowledge discovery in a set of text documents seems much more complex than the process of knowledge discovery in databases [2].

A Text Mining Analysis of Academic Libraries’ Tweets, a respected publication in the field, deals with clarifying the operation of text mining. The authors try to apply a text mining approach to a significant dataset of tweets by academic libraries and authors acquiring information about searching the most frequently one word, two-word and tree-word expressions and visiting academic libraries. These findings highlight the importance of using data and text mining approaches in understanding the aggregate social data of academic libraries to aid in decision-making and strategic planning for marketing of services [3].

Seo Wonchul researched how to extract product information from patent database using text mining technique and then generate product connection rules represented as directed pairs of products. Finally, authors evaluated potential value of product opportunities taking into account firm’s internal capabilities [4]. Their approach can facilitate product-oriented research and development by presenting a front-end model for new product development and deriving feasible product opportunities according to the target firm’s internal capabilities [4].

Knowledge discovery of text mining is a general process which in our research consists of several steps are shown in Fig. 1.

Fig. 1.
figure 1

The steps of knowledge discovery using text mining

  1. (1)

    Obtaining data and transformation into required format: The first step in this process is to create an input text file containing a list of attributes and values. Obtained unstructured data must be transformed into a structured CSV file format. Subsequently, the modified data are better processed and the work is efficient and more effective.

  2. (2)

    Pre-processing: CSV file is used as the input data source for RapidMiner. This source data is loaded into RapidMiner and functions for editing the text and eliminating inconsistent or erroneous data are used. Subsequently, the values are retyped using “Process Documents from Data” operator, where methods for text editing as tokenization, filters, stop wording, etc. are used. Such modified data are ready to be processed with analytical techniques.

  3. (3)

    Apply Clustering (Descriptive Analytics) technique: The output from the “Process Documents from Data” operator consists of:

    1. (a)

      a word list,

    2. (b)

      a document vector.

The word list is not needed for clustering; however, the document vector is necessary. The output from the Process Documents from Data” is obtained by pre-processing data interpreted as partial results for the company and consequently used as a basis data set for future analysis [6].

The CRISP-DM methodology was developed by the means of the effort of a consortium initially composed with Daimler-Chrysler, SPSS and NCR. CRISP-DM stands for CRoss-Industry Standard Process for Data Mining [13, 14]. It consists on a cycle that comprises six stages, captured on Fig. 2.

Fig. 2.
figure 2

Phases of CRISP-DM [10]

This initial phase, Business understanding, focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives.

The Data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information.

Subsequent Data preparation phase covers all activities to construct the final dataset from the initial raw data.

In the Modelling phase, various modelling techniques are selected and applied and their parameters are calibrated to optimal values.

During the Evaluation stage, the obtained model (or models) is more thoroughly evaluated and the steps executed to construct the model are reviewed to be certain it properly achieves the business objectives.

Creation of the model is generally not the end of the project. The Deployment phase focuses on the organisation and presentation of knowledge and results in a way that the customer can further use.

The sequence of this six stages is not rigid. All these stages are duly organised, structured and defined, allowing that a project could be easily understood or revised [13].

Our proposal of knowledge discovery platform is based on the CRISP-DM methodology. Each phase of our approach corresponds with this methodology. The proposed paper shows only partial results from this process.

2.2 Get Data from Form to CSV Format

Acquired forms from car body works contain a set of data relating to breakdowns which occurred in this area. With regard to the availability of the data, we work only with big breakdowns and the duration of breakdowns is more than 30 min. All of attributes and values of breakdowns are stored in application forms. These forms are not suitable for the analysis and therefore are transformed using a program for extracting information from reports about major breakdowns in manufacturing.

Information on major breakdowns are stored in the similar structure for each breakdown, one report corresponds to one XLSX file. These reports are stored hierarchically in folders grouped by date of a breakdown.

For further processing of data, it is necessary to convert these reports to CSV format. Program for extracting this information is written in C# using Visual Studio 2015 as Windows form application. The program sequence is divided into the following sections:

  • Select the topmost folder in the hierarchy.

  • Select the file name and path where is extracted data stored.

  • Find all relevant XLSX reports of major breakdown.

  • Extract requested information from those reports.

  • Get the information (parsing values in given cell range).

  • Clean/transform the information.

  • Save extracted information to a CSV file.

Since XLSX reports on major faults are stored in OpenXml format, library OpenXml was used to extract data from these reports [5]. In order to process a large number of reports, ongoing extraction and transformation of data is a parallel process. At the same time, 4 reports of breakdowns are processed.

2.3 RapidMiner

RapidMiner is currently one of the most used open source predictive analytics platforms for data analysis. It is accessible as a stand-alone application for information investigation and as a data mining engine for the integration into own products. Rapid miner provides an integrated environment for data mining and machine learning procedures, including [11]:

  • extracting the data from different source systems; transforming the data and loading into a data warehouse (DW) or data repository other applications,

  • data pre-processing and visualization,

  • predictive analytics and statistical modelling, evaluation, and deployment.

What makes it even more powerful is that it provides learning schemes, models and algorithms from WEKA and R scripts [11].

Rapid miner provides a graphical user interface (GUI) to design and execute analytical workflows. Those workflows form a process, which consists of multiple Operators. A graphical user Interface (GUI) allows connecting the operators with each other in the process view. Each independent operator carries out one task within the process and forms the input to another operator in the workflow. The major function of a process is the analysis of the data which is retrieved at the beginning of the process [11].

Rapid miner offers large amount of different operators, which can be easily extended with existing extensions. There are packages for text processing, web mining, weka extensions, R scripting, series extension, python scripting, anomaly detection and more [7].

2.4 Text Mining Using RapidMiner

The RapidMiner Text Extension adds all operators necessary for statistical text analysis. Text from different data sources can be loaded and can be transformed by different filtering techniques, to analyse text data. The RapidMiner Text Extensions support several text formats including plain text, HTML, or PDF. It also provides standard filters for tokenization, stemming, stop word filtering, or n-gram generation. The Text Processing package can be installed and updated through the Market RapidMiner menu item under the Help menu. The Text Mining extension uses a special class for handling documents, called the document class. This class stores the whole document in combination with additional meta information [8].

3 Design Process for Analysis

3.1 The Proposal of Text Mining Process

Figure 3 shows the proposed process, which contains operators and sub process called “Process Documentation from data”. At the beginning of the process, there is “Read CSV” operator, which is used to read CSV files. The CSV files store text data in plain-text form, where all values of a record are in one line. Values of different attributes are separated by a column separator “,”. Our CSV file contains 64 attributes in 358 records.

Fig. 3.
figure 3

The proposal of text mining process in RapidMiner

For our research, one of these attributes called “ResponsibleEmp” is important, as it contains information about the employee responsible for the breakdowns. Records in CSV format capture information about serious problems in production requiring repairs longer than 30 min.

“Filter example” operator removes the headers from the data of CSV files that remain in the data after joining multiple CSV, as well as several erroneous entries with a time correction less than 30 min. “Replace Missing Values” operator replaces missing values in examples of selected attributes “ResponsibleEmp” by the replenishment value of “missing”. Next operator “Nominal to Text” changes the type of selected nominal attribute to the text and also ensures mapping of all values of attribute “ResponsibleEmp” to corresponding string values. After this process, the data is ready for text processing.

Subprocess operator called “Process Documents from data” you can see in Fig. 4. This operator generates word vectors from string attributes by using TF-IDF schema for creating corresponding vector in RapidMiner.

Fig. 4.
figure 4

Process documents from data

3.2 Tokenize

The first step in our process of text mining is using “Tokenize” operator. This operator splits the text of a document into a sequence of tokens. The simplest token is a character, although the simplest meaningful (to a human) token is a word. There are several options how to specify the splitting points (non-letters, specify characters, regular expression, linguistic sentences and linguistic tokens). The default setting is non-letters that will result in tokens consisting of one single word, what’s the most appropriate option before finally building the word vector [11]. In our paper, we do not split whole sentences, because in our data we can find only surnames or names. Some records involve non-standard character as “/, −” therefore we use tokenize with non-letters mode.

Figure 5, below, shows the effect of tokenizing on a single attributes called “ResponsibleEmp”. You can see that record with 2 or more values are split, punctuation mark characters are removed and tokens are separated. Some of values, mainly surname, are replace by ???, because these type of data are sensitive for company and must be anonymised.

Fig. 5.
figure 5

Effect of tokenizing on a single attribute

3.3 Filter Stop Words

Next step in our pre-processing is used “Filter Stopword (Czech)” operator. This operator filters Czech stopwords from a document by removing every token which equals a stopword from the built-in stopword list. This operator filters common words as prepositions, conjunctions, adverbs and so on. It also reduces the unnecessary words and helps improve system performance [12]. Our records include only the names and surnames and do not contain any of those unnecessary words, therefore using this operator is not necessary in our work. Nevertheless, this step is included in our proposed process, because “Filter stopword” operator is the second most commonly used in text processing (after tokenize) and for the future analysis, it can be very useful.

3.4 Filter Tokens (by Length)

The length of sequence of tokens can be taken into account. This operator filters tokens based on their length (i.e. the number of character they contain). Some records (name and surname) consist of academic degrees (in Slovak i.e. Ing., Bc., Mgr.), initials of names and this unnecessary information should be removed. Parameter “min chars” defines minimal number of characters that a token must contain to be considered and “max chars” describes the maximal number of characters allowed. Slovak academic degrees are usually not longer than 3 characters; therefore, the borders were set to 4–25. Some examples you can see in Fig. 6.

Fig. 6.
figure 6

The source and result of using filter tokens operator by length

3.5 Transform Case

In the process documents, we add operator “Transform Cases”. “Transform Cases” operator transforms all characters in a document to either lowercase or upper case. Transform case is necessary in order to avoid problems between similar words that differ in lowercase or uppercase. For example “GOLADA” is transformed to “golada”, expression “Ertrese” is converted to “ertrese”. Transform to lower case is very useful for next processing of text mining.

3.6 Replace Tokens

“Replace Tokens” another operator used, which allows replacing of substrings within each tokens. The user can specify arbitrary pattern-replacement-pairs in the replace_dictionary parameter. The left column of the table specifies what should be replaced and the right column specifies the replacement.

In our case, Slovak language includes diacritical marks which should be removed and replaced by English characters for their easier use in the next step of document process, character “á” is replaced by “a”, “š” is replaces by “s” etc.

3.7 Tokenize (2)

The last operator applied is “Tokenize”, which is suitable for segregation of name and surname. As parameter “mode”, “regular expression” is selected. In a regular expression, a specific chars, as [a–z] any lowercase letter, .* multiple arbitrary character, etc. can be defined. Our regular expression removes names that are not carrying significant information. An example of a regular expression, you can see in Fig. 7.

Fig. 7.
figure 7

Regular expression parameter

“Clustering” operator is prepared only for further research using advanced text mining techniques and “WordList to data” operator transform text data, as result of “Process Documents from Data”, into data format.

4 Evaluation and Results

The result of this research, which you can see in Fig. 8, is a list of responsible employees and a number of breakdowns they are responsible for.

Fig. 8.
figure 8

List of responsible employee and count of breakdowns

We analysed data of a one-year period. On the basis of the results, company can design a system motivation and a method of the solution. Based on the acquired results, the given company can analyse the problem with a worker responsible for the largest number of breakdowns. The company can propose possible measures to improve the functioning of the production process.

The processed data provide a partial result for the company and also the data serves as a basic data set for the further research using advanced data mining techniques.