Abstract
Large amounts of data are generated daily in the industrial field, the safety of which is related to national security and social interests. Data classification is an important component of data protection because different types of data may require different protection methods. When compared with traditional data, industrial data mostly comprise real-time monitoring data, thereby possessing high requirements for time efficiency. However, the existing classification methods aimed at accuracy optimization cannot meet the requirements of time efficiency. To solve this problem, a random tree-based random forest model is proposed. This model is a combination classification model with attribute sampling, which can have the accuracy of random forest and the rapidity of random tree. Experiments show that the proposed model improves the accuracy of the existing single model, decreases the lost time of random forest, and is suitable for data classification in an industrial environment.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
With the advent of the new industrial revolution era, the industrial control system has slowly moved from being closed to open, and its security has become increasingly severe. The protection of industrial data security is an urgent problem begging for solutions in the development of the industrial Internet. In this study, data classification is the precondition and the core foundation of industrial Internet data security considering its usefulness when making strategies for classification decision-making in data protection. However, compared with traditional data, the volume of industrial Internet data is extremely large, causing difficulty in data classification. Furthermore, the data mostly comprise real-time monitoring data; thus, they have higher requirements for time efficiency. Relevant industry enterprises are accelerating the formulation of classification standards and classification technology. The automatic classification of industrial Internet data has been an emerging trend.
The application of data classification algorithms in machine learning to industrial Internet data classification is an experimental approach to the automatic classification of industrial data. However, compared with traditional data, industrial data require higher time efficiency. Therefore, establishing a model that considers accuracy and time is necessary to ensure the performance of industrial Internet data. However, the current single classifier does not consider both accuracy and time efficiency. To solve this problem, this research proposes the RFT classification model using fault data of fan tooth belts. The research contributions are as follows.
-
An automatic classification method for industrial data is proposed on the basis of random tree. Compared with the general random forest model, attribute sampling is applied to improve classification time efficiency. Experiments show that the proposed model can save more time than the random forest.
-
The model proposed in this paper is based on ensemble learning. Experiments show that this model has higher classification accuracy compared with the single models such as Naive Bayes and random tree.
The remainder of this paper is organized as follows. Section 2 introduces the related works. Section 3 establishes the classification model proposed in this research. Section 4 conducts an experimental analysis of the proposed model. Section 5 summarizes the work of this research.
2 Related Works
Many studies at home and abroad focus on data protection and data classification, but studies scarcely specialize in industrial data classification. To solve the problem of industrial data classification, Xu et al. [1] proposed a classification method for remotely sensed data sources using the two-branch convolution neural network (CNN); Dörksen et al. [2] presented a ComRef-2D-ConvHull method for linear classification optimization in lower-dimensional feature space, which is based on ComRef. However, these algorithms focused on accuracy, and their time efficiency is not very good. Platos et al. [3] described the processing of two different datasets acquired from a steel-mill factory using three different methods, namely, SVM, Fuzzy Rules, and Bayesian classification. However, its accuracy is insufficient. For other typical classification models, Chutia et al. [4] developed an effective method based on the principal component analysis technique to improve the classification accuracy of Random Forest both on predictive ability and computational expenses, but the principal component analysis technique assumes that the variables obey Gaussian distribution. When the variables do not obey Gaussian distribution (e.g., uniform distribution), scaling and rotation will occur; thus, it is not suitable for time-series data of the industrial Internet. Tong et al. [5] proposed a novel privacy-preserving naive Bayes learning scheme with multiple data sources. The proposed scheme enables a trainer to train a naive Bayes classifier over the dataset provided jointly by different data owners, without the help of a trusted curator; however, its time performance is poor. Augereau et al. [6] proposed a document image classification method by combining textual features extracted with the bag of words technique and visual features extracted with the bag of visual words technique, but data generated by Internet of Things devices are often structured and numerical and is thus unsuitable for non-text industrial data.
In summary, current classification methods mostly focus on text and image data, and little research is specialized for industrial data classification. Furthermore, current classification methods highly focus on accuracy, which cannot meet the time efficiency requirements of industrial data. Moreover, classification accuracy requires continued improvements.
3 Classification Model
The single classifier has limited classification accuracy, whereas the combination model of the naive Bayes algorithm and decision tree consumes considerable time. Decreasing the time consumption of the combined model is a core idea of the model built in this research. On this basis, a random forest model based on random tree is proposed. Random forest is an ensemble classifier that integrates many decision trees into forests and uses them together to predict the final results. The classifier can solve the inherent shortcomings of a single model, compensating for shortcomings and avoiding limitations. Specifically, it first generates j training sets using a bootstrap method that can regenerate many new samples of the same size from the samples. For each training set, a decision tree is constructed. When the nodes search for features to split, some features are randomly extracted from the feature set, and the optimal solution is found among the extracted features to split nodes. Considering the bagging idea, the random forest method is equivalent to sampling both samples and features, possibly avoiding over-fitting. At the same time, given that it is a combination-based algorithm, its accuracy is usually higher than other machine-learning algorithms. Random tree is a method of randomly selecting several attributes to construct a tree; thus, its classification time is relatively low. This research applies the concept of random tree to random forest to improve the efficiency of random forests. First, we generate several training samples, and then some attributes are randomly selected. Then, we extract features to make decisions. In other words, all samples, attributes, and features are sampled to construct classifiers in our model. The principle of the model is as follows:
Assuming that the training sample has l records, then the total sample can be marked as \( G = \{ g_{1} ,g_{2} , \ldots g_{l} \} \). j trees are generated after sampling and \( j \in \{ 1,2,3 \ldots d\} \) \( d > 0 \). The sample attribute \( X\, \subseteq\, R^{n} \) is a set of n-dimensional vectors. Attribute probability \( P = \{ 0,1\} \) is a set of two-dimensional vectors, and \( E\, \subseteq\, R^{s} \) is a set of s-dimensional eigenvectors. The output \( Y = \{ c_{1} ,c_{2} ,c_{3} \ldots c_{k} \} \) is a set of class tags, and the input attribute vector is \( x \in X \). The attribute probability vector is \( p \in P \), the feature vector is \( e \in E \), and the output vector is \( y \in Y \).
Then, the RFT is constructed as follows:
For the ith(i <= j) subsample in a Random Forest \( T = \{ g_{1} ,g_{3} , \ldots g_{2h - 1} \} \), \( h\,<\, \frac{l + 1}{2} \), assuming that the selection of attributes uses even multiple interpolation sampling. Then, the set of input attributes is
The feature vectors are extracted from the input attributes as follows:
Then, the eigenvector is input into the classifier to obtain the classification result of the ith tree marked as \( y_{i} \).
The final classification result of the classifier is
Figure 1 shows the classification principle.
4 Experiment
This section illustrates the classification performance of the RFT model via multi-round experiments on fault data of fan tooth profiles, and we comprehensively analyze the performance of the model via comparative experiments.
4.1 Description of Source Data
This study uses the fan tooth profile with fault data as the research object; these data usually comprise hundreds of variables as a type of SCADA monitoring data. Thus, the data used in this screening retained 28 continuous numerical variables, which cover the fan working parameters, environmental parameters, state parameters, and other dimensions. Table 1 presents the name and description of the variables.
Prior to classification, the relationship between the attributes and group undergoes preliminarily analysis. Taking pitch 3_ng5_tmp, Int_tmp, Environment_tmp, and time as examples, Figs. 2, 3, 4 and 5 present the effects of these attributes on the group. The attributes in Table 1 have a certain functional relationship with the classification identifiers and can be used as classification attributes.
4.2 Evaluation Indicators
This paper reports the kappa coefficient and accuracy to observe the performance. Kappa coefficient is a statistic that measures inter-rater agreement for qualitative items. It is generally thought to be a more robust measure than a simple percent agreement calculation because it considers the possibility of the agreement occurring by chance [7]. Accuracy refers to the ratio of the number of correctly predicted samples to the total number of predicted samples. In bi-classification, for example, if TP indicates that the forecast result is positive and the actual result is positive, FP represents that the forecast result is positive and the actual result is negative, TN represents that the forecast result is negative and actual is negative, and FN represents that the forecast result is negative and actual is positive. Then, accuracy can be calculated as follows:
Although this research deals with a non-binary classification problem, Formula (4) describes the calculation method. The difference is that TP, FP, TN, and FN are weighted values here.
4.3 Experimental Verification
To verify the effectiveness of the proposed algorithm, the algorithm is compared with random forest [8], random tree, and naive Bayes; then, the parameters in the experiment are determined via a ten-fold cross-validation process. k denotes kappa coefficient, NB denotes naive Bayes, RF denotes random forest, and RT denotes random tree in the next part.
First Group of Experiments.
The number of training samples and test samples selected in this group of experiments is 10,833. Five algorithms are used to observe classification performance. Tables 2 and 3 reflect their performances.
Second Group of Experiments.
The number of training samples and test samples selected in this group of experiments is 10,552. Five algorithms are used to observe classification performance. Tables 4 and 5 display their performances.
Third Group of Experiments.
The number of training samples and test samples selected in this group of experiments is 10,403. Five algorithms are used to observe classification performance. Tables 6 and 7 present their performances.
Tables 3, 5 and 7 are performance comparisons between RFT and other models. Assuming that all other models are represented by other models, the numerical formulas in these tables are as follows.
4.4 Analysis of Experimental Results
As Table 3 shows, the accuracy and kappa coefficient of the RFT model are not lower than those of other models. The accuracy of the RFT model is 2.44% higher than that of the random tree model and 9.79% higher than that of the Naive Bayes model. In addition, the time loss of the RFT model is considerably lower than other benchmark models, except for the naive Bayes model. However, given the low accuracy of naïve Bayes, the RFT model is compatible with accuracy and time loss. In summary, the RFT model is suitable for industrial data classification, considering time efficiency and classification accuracy.
5 Conclusions
This research proposes a “random tree-based random forest model” for classification of SCADA real-time monitoring data. The model uses the idea of random tree to extract features from attributes after sampling, substantially decreasing the loss time compared with random forest with higher classification accuracy than single models, such as Naïve Bayes. Experiments show that the RFT model can simultaneously meet the needs of industrial data for classification accuracy and classification time efficiency.
References
Xu, X., Wei, L., Ran, Q., et al.: Multisource remote sensing data classification based on convolutional neural network. IEEE Trans. Geosci. Remote Sens. PP(99), 1–13 (2018)
Dörksen, H., Mönks, U., Lohweg, V.: Fast classification in industrial big data environments. In: 19th International Conference on Emerging Technologies & Factory Automation (ETFA 2014). IEEE (2014)
Platos, J., Kromer, P.: Prediction of multi-class industrial data. In: International Conference on Intelligent Networking & Collaborative Systems (2013)
Chutia, D., Borah, N., Baruah, D., et al.: An effective approach for improving the accuracy of a random forest classifier in the classification of Hyperion data. Appl. Geomat. (2) (2019)
Tong, L., Jin, L., Liu, Z., et al.: Differentially private Naive Bayes learning over multiple data sources. Inf. Sci. 444, 89–104 (2018). S0020025518301415
Augereau, O., Journet, N., Vialard, A., et al.: Improving classification of an industrial document image database by combining visual and textual features. In: IAPR International Workshop on Document Analysis Systems. IEEE (2014)
Liu, C., Wang, W., Zhang, Y., et al.: Predicting the popularity of online news based on multivariate analysis. In: 2017 IEEE International Conference on Computer and Information Technology (CIT). IEEE Computer Society (2017)
Svetnik, V.: Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 43(6), 1947 (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Liu, C., Chen, X., Sun, Y., Yang, S., Li, J. (2019). RFT: An Industrial Data Classification Method Based on Random Forest. In: Ning, H. (eds) Cyberspace Data and Intelligence, and Cyber-Living, Syndrome, and Health. CyberDI CyberLife 2019 2019. Communications in Computer and Information Science, vol 1137. Springer, Singapore. https://doi.org/10.1007/978-981-15-1922-2_38
Download citation
DOI: https://doi.org/10.1007/978-981-15-1922-2_38
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-1921-5
Online ISBN: 978-981-15-1922-2
eBook Packages: Computer ScienceComputer Science (R0)