Keywords

1 Introduction

With the advent of the new industrial revolution era, the industrial control system has slowly moved from being closed to open, and its security has become increasingly severe. The protection of industrial data security is an urgent problem begging for solutions in the development of the industrial Internet. In this study, data classification is the precondition and the core foundation of industrial Internet data security considering its usefulness when making strategies for classification decision-making in data protection. However, compared with traditional data, the volume of industrial Internet data is extremely large, causing difficulty in data classification. Furthermore, the data mostly comprise real-time monitoring data; thus, they have higher requirements for time efficiency. Relevant industry enterprises are accelerating the formulation of classification standards and classification technology. The automatic classification of industrial Internet data has been an emerging trend.

The application of data classification algorithms in machine learning to industrial Internet data classification is an experimental approach to the automatic classification of industrial data. However, compared with traditional data, industrial data require higher time efficiency. Therefore, establishing a model that considers accuracy and time is necessary to ensure the performance of industrial Internet data. However, the current single classifier does not consider both accuracy and time efficiency. To solve this problem, this research proposes the RFT classification model using fault data of fan tooth belts. The research contributions are as follows.

  • An automatic classification method for industrial data is proposed on the basis of random tree. Compared with the general random forest model, attribute sampling is applied to improve classification time efficiency. Experiments show that the proposed model can save more time than the random forest.

  • The model proposed in this paper is based on ensemble learning. Experiments show that this model has higher classification accuracy compared with the single models such as Naive Bayes and random tree.

The remainder of this paper is organized as follows. Section 2 introduces the related works. Section 3 establishes the classification model proposed in this research. Section 4 conducts an experimental analysis of the proposed model. Section 5 summarizes the work of this research.

2 Related Works

Many studies at home and abroad focus on data protection and data classification, but studies scarcely specialize in industrial data classification. To solve the problem of industrial data classification, Xu et al. [1] proposed a classification method for remotely sensed data sources using the two-branch convolution neural network (CNN); Dörksen et al. [2] presented a ComRef-2D-ConvHull method for linear classification optimization in lower-dimensional feature space, which is based on ComRef. However, these algorithms focused on accuracy, and their time efficiency is not very good. Platos et al. [3] described the processing of two different datasets acquired from a steel-mill factory using three different methods, namely, SVM, Fuzzy Rules, and Bayesian classification. However, its accuracy is insufficient. For other typical classification models, Chutia et al. [4] developed an effective method based on the principal component analysis technique to improve the classification accuracy of Random Forest both on predictive ability and computational expenses, but the principal component analysis technique assumes that the variables obey Gaussian distribution. When the variables do not obey Gaussian distribution (e.g., uniform distribution), scaling and rotation will occur; thus, it is not suitable for time-series data of the industrial Internet. Tong et al. [5] proposed a novel privacy-preserving naive Bayes learning scheme with multiple data sources. The proposed scheme enables a trainer to train a naive Bayes classifier over the dataset provided jointly by different data owners, without the help of a trusted curator; however, its time performance is poor. Augereau et al. [6] proposed a document image classification method by combining textual features extracted with the bag of words technique and visual features extracted with the bag of visual words technique, but data generated by Internet of Things devices are often structured and numerical and is thus unsuitable for non-text industrial data.

In summary, current classification methods mostly focus on text and image data, and little research is specialized for industrial data classification. Furthermore, current classification methods highly focus on accuracy, which cannot meet the time efficiency requirements of industrial data. Moreover, classification accuracy requires continued improvements.

3 Classification Model

The single classifier has limited classification accuracy, whereas the combination model of the naive Bayes algorithm and decision tree consumes considerable time. Decreasing the time consumption of the combined model is a core idea of the model built in this research. On this basis, a random forest model based on random tree is proposed. Random forest is an ensemble classifier that integrates many decision trees into forests and uses them together to predict the final results. The classifier can solve the inherent shortcomings of a single model, compensating for shortcomings and avoiding limitations. Specifically, it first generates j training sets using a bootstrap method that can regenerate many new samples of the same size from the samples. For each training set, a decision tree is constructed. When the nodes search for features to split, some features are randomly extracted from the feature set, and the optimal solution is found among the extracted features to split nodes. Considering the bagging idea, the random forest method is equivalent to sampling both samples and features, possibly avoiding over-fitting. At the same time, given that it is a combination-based algorithm, its accuracy is usually higher than other machine-learning algorithms. Random tree is a method of randomly selecting several attributes to construct a tree; thus, its classification time is relatively low. This research applies the concept of random tree to random forest to improve the efficiency of random forests. First, we generate several training samples, and then some attributes are randomly selected. Then, we extract features to make decisions. In other words, all samples, attributes, and features are sampled to construct classifiers in our model. The principle of the model is as follows:

Assuming that the training sample has l records, then the total sample can be marked as \( G = \{ g_{1} ,g_{2} , \ldots g_{l} \} \). j trees are generated after sampling and \( j \in \{ 1,2,3 \ldots d\} \) \( d > 0 \). The sample attribute \( X\, \subseteq\, R^{n} \) is a set of n-dimensional vectors. Attribute probability \( P = \{ 0,1\} \) is a set of two-dimensional vectors, and \( E\, \subseteq\, R^{s} \) is a set of s-dimensional eigenvectors. The output \( Y = \{ c_{1} ,c_{2} ,c_{3} \ldots c_{k} \} \) is a set of class tags, and the input attribute vector is \( x \in X \). The attribute probability vector is \( p \in P \), the feature vector is \( e \in E \), and the output vector is \( y \in Y \).

Then, the RFT is constructed as follows:

For the ith(i <= j) subsample in a Random Forest \( T = \{ g_{1} ,g_{3} , \ldots g_{2h - 1} \} \), \( h\,<\, \frac{l + 1}{2} \), assuming that the selection of attributes uses even multiple interpolation sampling. Then, the set of input attributes is

$$ \begin{array}{*{20}l} {X = \{ p_{0} \cdot x_{1} ,p_{1} \cdot x_{2} ,p_{0} \cdot x_{3} \ldots p_{1} \cdot x_{n} \} } \hfill \\ {p_{0} = 0,p_{1} = 1,n = 2m,m > 0} \hfill \\ \end{array} $$
(1)

The feature vectors are extracted from the input attributes as follows:

$$ X\, - > \,E\, = \,\{ e_{1} ,e_{2} ,e_{3} \ldots e_{s} \} $$
(2)

Then, the eigenvector is input into the classifier to obtain the classification result of the ith tree marked as \( y_{i} \).

The final classification result of the classifier is

$$ Y = vote\{ y_{1} ,y_{2} \ldots y_{i} \ldots y_{j} \} $$
(3)

Figure 1 shows the classification principle.

Fig. 1.
figure 1

Principle of the RFT model

4 Experiment

This section illustrates the classification performance of the RFT model via multi-round experiments on fault data of fan tooth profiles, and we comprehensively analyze the performance of the model via comparative experiments.

4.1 Description of Source Data

This study uses the fan tooth profile with fault data as the research object; these data usually comprise hundreds of variables as a type of SCADA monitoring data. Thus, the data used in this screening retained 28 continuous numerical variables, which cover the fan working parameters, environmental parameters, state parameters, and other dimensions. Table 1 presents the name and description of the variables.

Table 1. Description of source data.

Prior to classification, the relationship between the attributes and group undergoes preliminarily analysis. Taking pitch 3_ng5_tmp, Int_tmp, Environment_tmp, and time as examples, Figs. 2, 3, 4 and 5 present the effects of these attributes on the group. The attributes in Table 1 have a certain functional relationship with the classification identifiers and can be used as classification attributes.

Fig. 2.
figure 2

Relationship between pitch 3_ng5_tmp and group

Fig. 3.
figure 3

Relationship between Int_tmp and group

Fig. 4.
figure 4

Relationship between Environment_tmp and group

Fig. 5.
figure 5

Relationship between time and group

4.2 Evaluation Indicators

This paper reports the kappa coefficient and accuracy to observe the performance. Kappa coefficient is a statistic that measures inter-rater agreement for qualitative items. It is generally thought to be a more robust measure than a simple percent agreement calculation because it considers the possibility of the agreement occurring by chance [7]. Accuracy refers to the ratio of the number of correctly predicted samples to the total number of predicted samples. In bi-classification, for example, if TP indicates that the forecast result is positive and the actual result is positive, FP represents that the forecast result is positive and the actual result is negative, TN represents that the forecast result is negative and actual is negative, and FN represents that the forecast result is negative and actual is positive. Then, accuracy can be calculated as follows:

$$ Acc. = (TP + TN)/(TP + TN + FP + FN) $$
(4)

Although this research deals with a non-binary classification problem, Formula (4) describes the calculation method. The difference is that TP, FP, TN, and FN are weighted values here.

4.3 Experimental Verification

To verify the effectiveness of the proposed algorithm, the algorithm is compared with random forest [8], random tree, and naive Bayes; then, the parameters in the experiment are determined via a ten-fold cross-validation process. k denotes kappa coefficient, NB denotes naive Bayes, RF denotes random forest, and RT denotes random tree in the next part.

First Group of Experiments.

The number of training samples and test samples selected in this group of experiments is 10,833. Five algorithms are used to observe classification performance. Tables 2 and 3 reflect their performances.

Table 2. Classifier performance when number of instances = 10833
Table 3. Performance comparison of RFT with other models when number of instances = 10833

Second Group of Experiments.

The number of training samples and test samples selected in this group of experiments is 10,552. Five algorithms are used to observe classification performance. Tables 4 and 5 display their performances.

Table 4. Classifier performance when number of instances = 10552
Table 5. Performance comparison of RFT with other models when number of instances = 10552

Third Group of Experiments.

The number of training samples and test samples selected in this group of experiments is 10,403. Five algorithms are used to observe classification performance. Tables 6 and 7 present their performances.

Table 6. Classifier performance when number of instances = 10403
Table 7. Performance comparison of RFT with other models when number of instances = 10403

Tables 3, 5 and 7 are performance comparisons between RFT and other models. Assuming that all other models are represented by other models, the numerical formulas in these tables are as follows.

$$ Acc.\_cmp = (Acc._{RFT} \, - \,Acc._{other\,\bmod el} )/Acc._{other\,\bmod el}^{{}} $$
(5)
$$ k\_cmp = (k_{RFT} - k_{other\,\bmod el} )/k_{other\,\bmod el}^{{}} $$
(6)
$$ time\_cmp = (time_{RFT} - time_{other\,\bmod el} )/time_{other\,\bmod el}^{{}} $$
(7)

4.4 Analysis of Experimental Results

As Table 3 shows, the accuracy and kappa coefficient of the RFT model are not lower than those of other models. The accuracy of the RFT model is 2.44% higher than that of the random tree model and 9.79% higher than that of the Naive Bayes model. In addition, the time loss of the RFT model is considerably lower than other benchmark models, except for the naive Bayes model. However, given the low accuracy of naïve Bayes, the RFT model is compatible with accuracy and time loss. In summary, the RFT model is suitable for industrial data classification, considering time efficiency and classification accuracy.

5 Conclusions

This research proposes a “random tree-based random forest model” for classification of SCADA real-time monitoring data. The model uses the idea of random tree to extract features from attributes after sampling, substantially decreasing the loss time compared with random forest with higher classification accuracy than single models, such as Naïve Bayes. Experiments show that the RFT model can simultaneously meet the needs of industrial data for classification accuracy and classification time efficiency.