Keywords

1 Introduction

Recent changes and important contribution in the fields of artificial intelligence, robotics and automation technologies are having a strong impact on knowledge management processes that form the basis for decision making in industrial, health, home, financial, business, social networks and many others, in order to create smart environments [8, 16, 24, 35, 38, 48]. In addition, many terms have been coined, Smart Manufacturing, Smart Production, Industrial Internet, i4.0, Connected Industry 4.0, to identify all that is encompassed by the paradigm of the fourth industrial revolution. These terms are representative of the changes in the industrial model know so far, by the irruption of the Internet of Things (IoT), Wireless Sensor Networks (WSN), cloud computing and Cyber physical systems (CPS) [17, 23, 33].

These recent challenges of industry field require technologies that have been used previously for the construction of dynamic, intelligent, flexible and open applications, capable of working in a real time environment [13, 14, 27].

These emerging technologies have renewed the interest of researchers, universities, companies and governments in applying predictive analysis to the industrial environment. In this sub-discipline of data analysis, they found techniques and tools for the development of models capable of predicting future events, failures or behavior [34, 43]. Prediction models are created by leveraging statistical techniques, machine learning (ML) or data mining to extract behavioral patterns found in a dataset and when risks and opportunities are identified in these patterns [19, 32].

Several intelligent systems have been developed using sensors [46, 47, 52]. The incorporation of sensors in infrastructure, ambient intelligence, products, manufacturing equipment or production monitoring is an implicit component of the Industry 40 paradigm. Some authors insist that sensors networks still require development, although there are extensive proposals in the field of sensor networks, their use still presents many challenges in automatic data fusion, processing and the integration of large volumes of data generated by them [44, 45, 53].

In order to obtain useful information in the decision – making processes, such as prediction to anticipate failures, demands, production, sales volume, an overwhelming amount of data need to be processed, this still remains a challenge [5]. This paper is divided as follows: first section provides a brief background on Industry 4.0, predictive maintenance, the application of Machine Learning (ML) in the context of Industry 4.0 and prediction of failure. Then the conducted case study is described. Finally results showing the accuracy of applied Logistic Regression and Random Forest are outlined and the conclusion and future work are presented.

2 Industry 4.0

The concept of Industry 4.0 was born in Germany during the year 2011, when the government and the business sector, led by Bosch, formed a research group to find a common framework that would allow for the application of new technologies. They delivered their first report in 2012, it was presented in public at the Hannover Fair in 2013. This was the beginning of the paradigm that is now known as the fourth industrial revolution, different countries refer to it by different terms according the initiatives developed in those countries. It is applied within the industrial ecosystem both at the macro level at SMEs [33, 51].

In addition to the enablers such as IoT, CPS, Big Data, Cybersecurity, 3D printing, there is a large number of requirements for the implementation of Industry 4.0. This research addresses predictive maintenance from the perspective of the data generated by sensors. In this case temperature sensors on HVAC systems installed on 20 buildings. The aim is detecting failures in HVAC equipment that obtain an extreme temperature, through the data collected by the sensors.

2.1 Predictive Maintenance

The concept of predictive maintenance is not new. However, as an axis of development for the adoption of the industry 4.0 scheme, predictive maintenance is the subject of research. The aim is to obtain models that reduce uncertainty in diagnoses. Ballesteros in [2] lists basic conditions that must be satisfied in order to determine that an organization has a predictive maintenance scheme:

  • When the operation of a piece of equipment is monitored and measured, it must be done in a non-intrusive way, under normal operating conditions.

  • The variable that has to be measured in order to make the predictions, must fulfill conditions of: repeatability, analysis, parametrization and diagnosis.

  • The results and the values of the measures can be expressed in physical units or correlated indexes.

There is a tendency to expand the research that allows applying predictive models in industrial environments presents the development of a predictive maintenance system for power equipment, other authors propose its application to wind turbines or the prediction of anomalies in triaxial machines [6, 25]. These works share common elements with: they are based on machine learning (ML) methods and pursue the development of ML algorithms that increase the accuracy of their predictions [34, 42].

2.2 Machine Learning for Prediction

There are important contributions on the field of artificial intelligence and its techniques such as Case Based Reasoning (CBR) [3, 9, 10, 13, 18] and machine learning algorithms, in order to make predictions, improve the results and to better generalize the dataset. Therefore, it is necessary construct models that facilitate prediction and analysis to take decision [12, 21, 22, 38].

In the last decades artificial intelligence and Machine Learning (ML) techniques have transcended into a great variety of areas such as neuroscience, social media [7, 38], scientific, health [28], industrial and economic activities, and a large number of scientific works have been published on this topic. This is indicative of its importance [1, 7, 29].

Thanks to Artificial Intelligence and Machine Learning (ML) algorithms, we not only develop solutions for processing of large data in the era of the Internet of Things and Big Data, in [36, 37] the authors using Bayesian Filters and other algorithms to processing and make prediction with sensor signals. On the basis of extracted features and patterns we can construct predictive models using data analysis and ML algorithms [40, 41].

In a current real environment, a dataset has to be obtained before ML techniques can be applied [11, 15, 49]. They will subsequently go through different phases, such as a pre-processing, data training and application of a learning model and finally an evaluation phase. From the description made in [54], a scheme of ML stages has been designed, in Fig. 1 they are described in the order in which they are performed. Data pre-processing is carried out in order to prepare raw data. At this stage data are unstructured, noisy, incomplete and inconsistent, and they are transformed to be used as inputs for the algorithms selected for training. Subsequently, test data will be used to train the developed model, also predictions that are extracted from the new set of test data will be obtained.

Fig. 1.
figure 1

Machine learning stages

To evaluate the model, error estimation data and the results of statistical tests are analyzed, these analyses are used to adjust the parameters of the applied algorithms and to determine if the use of other algorithms is necessary [32].

3 Industry 4.0 Environment Case Study

The Heating, Ventilation and Air Conditioning Systems (HVAC) control indoor climate, air’s temperature, humidity and pressure, creating an optimal production environment on industrial buildings. These equipment are crucial for the operation of a factory in the context of Industry 4.0. However routine maintenance do not always identify their failures.

The aim of predictive maintenance in Industry 4.0 is extend equipment life using different tools and techniques to identify abnormal patterns such as: vibration, temperature or balance. In accordance with the importance of HVAC Systems, a case study is presented. Following section describes a free dataset from temperature sensors installed on Heating, Ventilation and Air Conditioning System (HVAC) in 20 buildings [30].

3.1 Dataset Description

In this case study, a dataset that is organized by columns was used. It contains the optimal temperature record and the real values measured by sensors in buildings. It was used to analyze the behavior of an HVAC air conditioning system [30] and determine if the equipment is failing to keep the indoor temperatures in an optimal range.

This dataset contains a total of 8000 (eight thousand) temperature records (TargetTemp) captured by a sensor network, installed in a set of buildings who were between 0 and 30 years old, their age corresponded to age of the HVAC systems, identified by the independent variable ‘SystemAge’. Table 1 shows the structure of the dataset and its variables:

Table 1. Dataset

3.2 Dataset Pre-processing

Several authors have used algorithms to perform feature selection and preprocessing data [20, 39]. The system established a range for the normal temperatures and two types of alarm that indicate extreme temperatures and therefore a possible failure. These are described in Table 2 as follows:

Table 2. Normal and extreme temperature
  • Normal: within 5º of the optimum temperature.

  • Cold: 5º colder than the optimum temperature. It is classified as extreme temperature and a sign of possible failure.

  • Hot: 5º hotter than the optimum temperature. Also it is classified as extreme temperature and a sign of possible failure.

Two labels are added to the dataset ‘Difference’ and ‘FilterDifference’, in the first, the values obtained from the difference between ‘TargetTemp’ and ‘Actualtemp’ are stored. In ‘FilterDifference’ the binary conversion is carried out assigning 0 to the normal temperatures and 1 to the alarms for extreme temperature.

4 Results

Once the data were pre-processed, the extended dataset was used to divide the data into data train and data_test, the former was used to apply Machine Learning algorithms to obtain the prediction model. This model was then validated with the data_test. For the training of the data, two supervised learning algorithms will be used: Logistic Regression and Random Forest (RF) to evaluate the accuracy of each one in the prediction.

Logistic regression is a machine learning technique, statistical-inferential, which dates back to the 1960s, used in current scientific research. It is considered an extension of linear regression models, the difference is that is has a categorical variable capable of being binomial (0, 1) or multiple [32]. For the development of this research, the dataset was pre-processed so that the categorical variable (y) can be binomial. Applying the logistic regression analysis, we assume that y = 1, when the sensor sends an extreme temperature and y = 0 when the measured temperature (‘TargetTemp’) is within the normal range. Considering the above, the probability that the HVAC system is presenting a failure by recording extreme temperatures is given in Eq. 1:

$$ P\left( {y\, = \,0} \right)\, = \,1\, - \,P\left( {y\, = \,1} \right) $$
(1)
$$ Y\, = \,f(B_{0} \, + \,B_{1} X_{1 } \, + \,B_{2} X_{2} \, + \, \ldots \, + \, B_{n} X_{n} )\, + \,{\text{u}} $$
(2)

Where u is the error term and \( f \) the logistic function:

$$ f\left( z \right)\, = \,\frac{{e^{z} }}{{1\, + \,e^{z} }} $$
(3)

So that:

$$ E\left[ Y \right]\, = \,P\, = \,P\left( {Y\, = \,1} \right)\, = \,\frac{{e^{{B_{0} \, + \,B_{1} X_{1 } \, + \,B_{2} X_{2} \, + \, \ldots \, + \,B_{n} X_{n} }} }}{{1\, + \,e^{{B_{0} \, + \,B_{1} X_{1 } \, + \,B_{2} X_{2} \, + \, \ldots \, + \, B_{n} X_{n} }} }} $$
(4)
$$ ln\left( {\frac{P}{1\, - \,P}} \right)\, = \,B_{0} X_{0} \, + \,B_{1} X_{1 } \, + \,B_{2} X_{2} \, + \, \ldots \, + \,B_{n} X_{n} $$
(5)

Where the set of independent variables is given by: \( x_{1 , } x_{2 , } \ldots , x_{n , } \) where n is the total number. To predict the probability (P), we use the logit function of the binary logistic regression model represented in Eq. 5. It is indicated how by means of the logistic regression model, an accuracy of 0.651375 can be obtained in the prediction of values. For the purposes of this dataset, the SystemAge column was taken as a characteristic and, as a label, the FilterDifference column, the total data used was 8000 records.

The accuracy results presented by this prediction show a value of 0.65375. The values of X_train and X_test indicated that 5600 and 2400 values are taken respectively. After applying the logistic regression model and obtaining its percentage of precision, the random forest classification algorithm was applied to the dataset. In this regard, various authors [26, 31] confirm that the random forest classifier is an effective tool for the prediction processes.

This classifier is also considered as a nonparametric statistical method that allows to address regression and classification problems of two or more classes. The recent research of Scornet et al. [50] cited by [26] demonstrates the coherence of RF and its performance parameters is very low.

In [4] random forest is defined as follows: “… a random forest is a classifier consisting if a collection of tree-structured classifiers {h (xk), k = 1,…} where the {Θk} are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input x…”.

The values that were generated in the prediction show an accuracy of 0.6425 for a total of 5600 records used as training data and 2400 for the test data model. The difference in the effectiveness of each is 0.0125 with a more optimal result of the logistic regression model when 70% of the data was used for training and a 30% for the tests in each machine learning algortihm. A predecition of failure (1) was obtained, malfunctions tended to occur in equipment that was between 15 and 30 years old.

5 Conclusions and Future Work

The proposed prediction model still in its early stage of development. This allows for the implementation of other machine learning techniques and for the use of larger datasets obtained from sensors networks installed in order environment. The results for the dataset used in this case study, show that the precision of the logistic regression model is similar to that of random forest, in predicting malfunction in the HVAC system.

The modeling and integration of the large volumes of industrial data that are generated by machines and collected by sensors, is a clear problem that still needs to be addressed in future researches. Thus, testing with other machine learning methods for classification, training and prediction. These test will provide the grounds for the development of algorithms that generate predictive models adapted for organizations, in the context of Industry 4.0.