RFT: An Industrial Data Classification Method Based on Random Forest

Liu, Caiyun; Chen, Xuehong; Sun, Yan; Yang, Shuaifeng; Li, Jun

doi:10.1007/978-981-15-1922-2_38

Caiyun Liu⁷,
Xuehong Chen⁷,
Yan Sun⁷,
Shuaifeng Yang⁷ &
…
Jun Li⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1137))

Included in the following conference series:

1065 Accesses

Abstract

Large amounts of data are generated daily in the industrial field, the safety of which is related to national security and social interests. Data classification is an important component of data protection because different types of data may require different protection methods. When compared with traditional data, industrial data mostly comprise real-time monitoring data, thereby possessing high requirements for time efficiency. However, the existing classification methods aimed at accuracy optimization cannot meet the requirements of time efficiency. To solve this problem, a random tree-based random forest model is proposed. This model is a combination classification model with attribute sampling, which can have the accuracy of random forest and the rapidity of random tree. Experiments show that the proposed model improves the accuracy of the existing single model, decreases the lost time of random forest, and is suitable for data classification in an industrial environment.

Access provided by Autonomous University of Puebla. Download conference paper PDF

An Advanced Random Forest Algorithm Targeting the Big Data with Redundant Features

RRF-BD: Ranger Random Forest Algorithm for Big Data Classification

Artificial Intelligence Random Forest Algorithm and the Application

Keywords

1 Introduction

With the advent of the new industrial revolution era, the industrial control system has slowly moved from being closed to open, and its security has become increasingly severe. The protection of industrial data security is an urgent problem begging for solutions in the development of the industrial Internet. In this study, data classification is the precondition and the core foundation of industrial Internet data security considering its usefulness when making strategies for classification decision-making in data protection. However, compared with traditional data, the volume of industrial Internet data is extremely large, causing difficulty in data classification. Furthermore, the data mostly comprise real-time monitoring data; thus, they have higher requirements for time efficiency. Relevant industry enterprises are accelerating the formulation of classification standards and classification technology. The automatic classification of industrial Internet data has been an emerging trend.

The application of data classification algorithms in machine learning to industrial Internet data classification is an experimental approach to the automatic classification of industrial data. However, compared with traditional data, industrial data require higher time efficiency. Therefore, establishing a model that considers accuracy and time is necessary to ensure the performance of industrial Internet data. However, the current single classifier does not consider both accuracy and time efficiency. To solve this problem, this research proposes the RFT classification model using fault data of fan tooth belts. The research contributions are as follows.

An automatic classification method for industrial data is proposed on the basis of random tree. Compared with the general random forest model, attribute sampling is applied to improve classification time efficiency. Experiments show that the proposed model can save more time than the random forest.
The model proposed in this paper is based on ensemble learning. Experiments show that this model has higher classification accuracy compared with the single models such as Naive Bayes and random tree.

The remainder of this paper is organized as follows. Section 2 introduces the related works. Section 3 establishes the classification model proposed in this research. Section 4 conducts an experimental analysis of the proposed model. Section 5 summarizes the work of this research.

2 Related Works

Many studies at home and abroad focus on data protection and data classification, but studies scarcely specialize in industrial data classification. To solve the problem of industrial data classification, Xu et al. [1] proposed a classification method for remotely sensed data sources using the two-branch convolution neural network (CNN); Dörksen et al. [2] presented a ComRef-2D-ConvHull method for linear classification optimization in lower-dimensional feature space, which is based on ComRef. However, these algorithms focused on accuracy, and their time efficiency is not very good. Platos et al. [3] described the processing of two different datasets acquired from a steel-mill factory using three different methods, namely, SVM, Fuzzy Rules, and Bayesian classification. However, its accuracy is insufficient. For other typical classification models, Chutia et al. [4] developed an effective method based on the principal component analysis technique to improve the classification accuracy of Random Forest both on predictive ability and computational expenses, but the principal component analysis technique assumes that the variables obey Gaussian distribution. When the variables do not obey Gaussian distribution (e.g., uniform distribution), scaling and rotation will occur; thus, it is not suitable for time-series data of the industrial Internet. Tong et al. [5] proposed a novel privacy-preserving naive Bayes learning scheme with multiple data sources. The proposed scheme enables a trainer to train a naive Bayes classifier over the dataset provided jointly by different data owners, without the help of a trusted curator; however, its time performance is poor. Augereau et al. [6] proposed a document image classification method by combining textual features extracted with the bag of words technique and visual features extracted with the bag of visual words technique, but data generated by Internet of Things devices are often structured and numerical and is thus unsuitable for non-text industrial data.

In summary, current classification methods mostly focus on text and image data, and little research is specialized for industrial data classification. Furthermore, current classification methods highly focus on accuracy, which cannot meet the time efficiency requirements of industrial data. Moreover, classification accuracy requires continued improvements.

3 Classification Model

The single classifier has limited classification accuracy, whereas the combination model of the naive Bayes algorithm and decision tree consumes considerable time. Decreasing the time consumption of the combined model is a core idea of the model built in this research. On this basis, a random forest model based on random tree is proposed. Random forest is an ensemble classifier that integrates many decision trees into forests and uses them together to predict the final results. The classifier can solve the inherent shortcomings of a single model, compensating for shortcomings and avoiding limitations. Specifically, it first generates j training sets using a bootstrap method that can regenerate many new samples of the same size from the samples. For each training set, a decision tree is constructed. When the nodes search for features to split, some features are randomly extracted from the feature set, and the optimal solution is found among the extracted features to split nodes. Considering the bagging idea, the random forest method is equivalent to sampling both samples and features, possibly avoiding over-fitting. At the same time, given that it is a combination-based algorithm, its accuracy is usually higher than other machine-learning algorithms. Random tree is a method of randomly selecting several attributes to construct a tree; thus, its classification time is relatively low. This research applies the concept of random tree to random forest to improve the efficiency of random forests. First, we generate several training samples, and then some attributes are randomly selected. Then, we extract features to make decisions. In other words, all samples, attributes, and features are sampled to construct classifiers in our model. The principle of the model is as follows:

Assuming that the training sample has l records, then the total sample can be marked as $ G = \{ g_{1} ,g_{2} , \ldots g_{l} \} $. j trees are generated after sampling and $ j \in \{ 1,2,3 \ldots d\} $ $ d > 0 $. The sample attribute $ X\, \subseteq\, R^{n} $ is a set of n-dimensional vectors. Attribute probability $ P = \{ 0,1\} $ is a set of two-dimensional vectors, and $ E\, \subseteq\, R^{s} $ is a set of s-dimensional eigenvectors. The output $ Y = \{ c_{1} ,c_{2} ,c_{3} \ldots c_{k} \} $ is a set of class tags, and the input attribute vector is $ x \in X $. The attribute probability vector is $ p \in P $, the feature vector is $ e \in E $, and the output vector is $ y \in Y $.

Then, the RFT is constructed as follows:

For the ith(i <= j) subsample in a Random Forest $ T = \{ g_{1} ,g_{3} , \ldots g_{2h - 1} \} $, $ h\,<\, \frac{l + 1}{2} $, assuming that the selection of attributes uses even multiple interpolation sampling. Then, the set of input attributes is

$$ \begin{array}{*{20}l} {X = \{ p_{0} \cdot x_{1} ,p_{1} \cdot x_{2} ,p_{0} \cdot x_{3} \ldots p_{1} \cdot x_{n} \} } \hfill \\ {p_{0} = 0,p_{1} = 1,n = 2m,m > 0} \hfill \\ \end{array} $$

(1)

The feature vectors are extracted from the input attributes as follows:

$$ X\, - > \,E\, = \,\{ e_{1} ,e_{2} ,e_{3} \ldots e_{s} \} $$

(2)

Then, the eigenvector is input into the classifier to obtain the classification result of the ith tree marked as $ y_{i} $.

The final classification result of the classifier is

$$ Y = vote\{ y_{1} ,y_{2} \ldots y_{i} \ldots y_{j} \} $$

(3)

Figure 1 shows the classification principle.

4 Experiment

This section illustrates the classification performance of the RFT model via multi-round experiments on fault data of fan tooth profiles, and we comprehensively analyze the performance of the model via comparative experiments.

4.1 Description of Source Data

This study uses the fan tooth profile with fault data as the research object; these data usually comprise hundreds of variables as a type of SCADA monitoring data. Thus, the data used in this screening retained 28 continuous numerical variables, which cover the fan working parameters, environmental parameters, state parameters, and other dimensions. Table 1 presents the name and description of the variables.

Table 1. Description of source data.

Full size table

Prior to classification, the relationship between the attributes and group undergoes preliminarily analysis. Taking pitch 3_ng5_tmp, Int_tmp, Environment_tmp, and time as examples, Figs. 2, 3, 4 and 5 present the effects of these attributes on the group. The attributes in Table 1 have a certain functional relationship with the classification identifiers and can be used as classification attributes.

4.2 Evaluation Indicators

This paper reports the kappa coefficient and accuracy to observe the performance. Kappa coefficient is a statistic that measures inter-rater agreement for qualitative items. It is generally thought to be a more robust measure than a simple percent agreement calculation because it considers the possibility of the agreement occurring by chance [7]. Accuracy refers to the ratio of the number of correctly predicted samples to the total number of predicted samples. In bi-classification, for example, if TP indicates that the forecast result is positive and the actual result is positive, FP represents that the forecast result is positive and the actual result is negative, TN represents that the forecast result is negative and actual is negative, and FN represents that the forecast result is negative and actual is positive. Then, accuracy can be calculated as follows:

$$ Acc. = (TP + TN)/(TP + TN + FP + FN) $$

(4)

Although this research deals with a non-binary classification problem, Formula (4) describes the calculation method. The difference is that TP, FP, TN, and FN are weighted values here.

4.3 Experimental Verification

To verify the effectiveness of the proposed algorithm, the algorithm is compared with random forest [8], random tree, and naive Bayes; then, the parameters in the experiment are determined via a ten-fold cross-validation process. k denotes kappa coefficient, NB denotes naive Bayes, RF denotes random forest, and RT denotes random tree in the next part.

First Group of Experiments.

The number of training samples and test samples selected in this group of experiments is 10,833. Five algorithms are used to observe classification performance. Tables 2 and 3 reflect their performances.

Table 2. Classifier performance when number of instances = 10833

Full size table

Table 3. Performance comparison of RFT with other models when number of instances = 10833

Full size table

Second Group of Experiments.

The number of training samples and test samples selected in this group of experiments is 10,552. Five algorithms are used to observe classification performance. Tables 4 and 5 display their performances.

Table 4. Classifier performance when number of instances = 10552

Full size table

Table 5. Performance comparison of RFT with other models when number of instances = 10552

Full size table

Third Group of Experiments.

The number of training samples and test samples selected in this group of experiments is 10,403. Five algorithms are used to observe classification performance. Tables 6 and 7 present their performances.

Table 6. Classifier performance when number of instances = 10403

Full size table

Table 7. Performance comparison of RFT with other models when number of instances = 10403

Full size table

Tables 3, 5 and 7 are performance comparisons between RFT and other models. Assuming that all other models are represented by other models, the numerical formulas in these tables are as follows.

$$ Acc.\_cmp = (Acc._{RFT} \, - \,Acc._{other\,\bmod el} )/Acc._{other\,\bmod el}^{{}} $$

(5)

$$ k\_cmp = (k_{RFT} - k_{other\,\bmod el} )/k_{other\,\bmod el}^{{}} $$

(6)

$$ time\_cmp = (time_{RFT} - time_{other\,\bmod el} )/time_{other\,\bmod el}^{{}} $$

(7)

4.4 Analysis of Experimental Results

As Table 3 shows, the accuracy and kappa coefficient of the RFT model are not lower than those of other models. The accuracy of the RFT model is 2.44% higher than that of the random tree model and 9.79% higher than that of the Naive Bayes model. In addition, the time loss of the RFT model is considerably lower than other benchmark models, except for the naive Bayes model. However, given the low accuracy of naïve Bayes, the RFT model is compatible with accuracy and time loss. In summary, the RFT model is suitable for industrial data classification, considering time efficiency and classification accuracy.

5 Conclusions

This research proposes a “random tree-based random forest model” for classification of SCADA real-time monitoring data. The model uses the idea of random tree to extract features from attributes after sampling, substantially decreasing the loss time compared with random forest with higher classification accuracy than single models, such as Naïve Bayes. Experiments show that the RFT model can simultaneously meet the needs of industrial data for classification accuracy and classification time efficiency.

References

Xu, X., Wei, L., Ran, Q., et al.: Multisource remote sensing data classification based on convolutional neural network. IEEE Trans. Geosci. Remote Sens. PP(99), 1–13 (2018)
Google Scholar
Dörksen, H., Mönks, U., Lohweg, V.: Fast classification in industrial big data environments. In: 19th International Conference on Emerging Technologies & Factory Automation (ETFA 2014). IEEE (2014)
Google Scholar
Platos, J., Kromer, P.: Prediction of multi-class industrial data. In: International Conference on Intelligent Networking & Collaborative Systems (2013)
Google Scholar
Chutia, D., Borah, N., Baruah, D., et al.: An effective approach for improving the accuracy of a random forest classifier in the classification of Hyperion data. Appl. Geomat. (2) (2019)
Google Scholar
Tong, L., Jin, L., Liu, Z., et al.: Differentially private Naive Bayes learning over multiple data sources. Inf. Sci. 444, 89–104 (2018). S0020025518301415
Article MathSciNet Google Scholar
Augereau, O., Journet, N., Vialard, A., et al.: Improving classification of an industrial document image database by combining visual and textual features. In: IAPR International Workshop on Document Analysis Systems. IEEE (2014)
Google Scholar
Liu, C., Wang, W., Zhang, Y., et al.: Predicting the popularity of online news based on multivariate analysis. In: 2017 IEEE International Conference on Computer and Information Technology (CIT). IEEE Computer Society (2017)
Google Scholar
Svetnik, V.: Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 43(6), 1947 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

China Industrial Control Systems Cyber Emergency Response Team, Beijing, China
Caiyun Liu, Xuehong Chen, Yan Sun, Shuaifeng Yang & Jun Li

Authors

Caiyun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xuehong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yan Sun
View author publications
You can also search for this author in PubMed Google Scholar
Shuaifeng Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan Sun .

Editor information

Editors and Affiliations

University of Science and Technology, Beijing, China
Huansheng Ning

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, C., Chen, X., Sun, Y., Yang, S., Li, J. (2019). RFT: An Industrial Data Classification Method Based on Random Forest. In: Ning, H. (eds) Cyberspace Data and Intelligence, and Cyber-Living, Syndrome, and Health. CyberDI CyberLife 2019 2019. Communications in Computer and Information Science, vol 1137. Springer, Singapore. https://doi.org/10.1007/978-981-15-1922-2_38

Download citation

DOI: https://doi.org/10.1007/978-981-15-1922-2_38
Published: 03 December 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-1921-5
Online ISBN: 978-981-15-1922-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

RFT: An Industrial Data Classification Method Based on Random Forest

Abstract

Similar content being viewed by others

An Advanced Random Forest Algorithm Targeting the Big Data with Redundant Features

RRF-BD: Ranger Random Forest Algorithm for Big Data Classification

Artificial Intelligence Random Forest Algorithm and the Application

Keywords

1 Introduction

2 Related Works

3 Classification Model