1 Introduction

To offer quality of service with minimal cost and without interruption is a goal of electric utilities. For this to occur, the important and expensive equipment, power transformers, work without faults or defects that cause interruptions in power supplying are necessary, because their replacement results in an expensive and long process (Marques et al. 2014).

Several studies in the literature emphasize the importance of developing predictive techniques to identify the defects and incipient faults in power transformers. In Marques et al. (2014), 33 years of monitoring of equipment faults is presented and the importance of the implementation of fault detection methods is exposed. In Wang et al. (2002), statistics of faults in power transformers are presented, as well as diagnostic methods such as the measure of the quality of the oil are described. In Júnior et al. (2015), a review of defects and faults in power transformers and how the existing prediction methods can help to identify them before a catastrophic failure occurs are shown.

The transformer insulation system, particularly the cellulosic material, is the component that is degraded the most and, therefore, defines the useful life of the equipment. In order to avoid the unavailability of the service and to reduce the amount of corrective maintenance, predictive techniques have been developed to detect incipient faults resulting from partial discharges (PD), caused by the decrease in the dielectric strength of the insulation system. Among the currently used techniques, the acoustic emission (AE) measurement stands out as it is noninvasive and enables the localization of PDs (Mohammadi et al. 2006), allowing the identification of the defect and the monitoring of the equipment while it is operating.

Table 1 Description of the data collected by the sensors

For a complete and efficient analysis of the condition of the equipment, computational classification tools can be used in conjunction with the predictive techniques to create expert systems which aids the decision-making process by a human expert. Many references in the literature address the problem of pattern classification of power transformers using measured data. Among them, Tang et al. (2008) and Hao and Cai-xin (2007) work with Dissolved Gas Analysis (DGA), using Bayesian Networks (BN) and Neural Networks (NN), respectively. In Kuo (2009), a Multilayer Perceptron Neural Network (MLP) is used, trained using acoustic emission measured data to identify the defect by classifying features extracted from real-time AE measurements.

This paper presents an approach using AE measurement to estimate the health of the power transformer, in which a Bayesian Network, combined with a Compact Genetic Algorithm (Harik et al. 1999) adapted for solving mixed integer programming problems (MIPcGA), is trained to classify the state of the equipment. This approach allows the qualitative health monitoring of the equipment while in operation. Therefore, if there are signs of incipient faults, preventive maintenance procedures are scheduled, which, by its characteristics, imply in lower expenses, risks, and wear with customers, especially the possibility of prior communication to consumers on the occasion of the need to temporarily suspend the power supply. The aim of the method proposed in this paper is to provide subsidy to the maintenance staff regarding the appropriate decision-making process on the intervention, or not, in the equipment.

This approach provides some advantages over the former ones: (i) allows to correlate variables, in addition to working with degrees of belief (probabilities), if compared with the NN; (ii) allows the calculation of posterior probabilities, which can provide classifications using a subset of nodes; and (iii) an incorrect result is not ruled out, due to the property exposed in items i and ii.

2 Incipient Failures and Partial Discharges

According to Butler-Purry and Bagriyanik (2003), incipient faults are those that slowly develop through a gradual deterioration of the insulating system. Because they are not observed at all instants of time, they are difficult to detect. It is therefore of great importance to early detect these failures, because, over time, when the condition of the equipment is degraded due to electrical, chemical, and thermal effects, incipient faults begin to persist and to intensify in the system (Butler-Purry and Bagriyanik 2003). To that end, important techniques for PDs detection emerged, which, according to Mohammadi et al. (2006) and Azevedo et al. (2009), are small sparks resulted from the degradation of the insulating system of the transformer and which indicate possible incipient failure.

Among the various techniques to detect PDs, in this paper, AE measurement was chosen to use, as it is a noninvasive method that allows the localization of the discharges by monitoring the equipment by using ultrasonic transducers (IEEE 2007), which, positioned strategically and securely on its outer surface, detect any acoustic wave from inside the transformer.

When comparing existing predictive techniques, the dissolved gas analysis (DGA) of the insulating oil, even being an already-established technique in the electricity sector, presents some weaknesses, mainly because of its low sensitivity for detection of partial discharges (IEEE 2007; Azevedo et al. 2009), which does not allow the detection of the incipient faults inside the transformer. In this context, the detection of partial discharges (PDs) by the acoustic emission method (AE) is efficient and also indicates the coordinates of the region where the activities of PDs are occurring (Azevedo et al. 2009). However, as a good practice, it may be considered in conjunction with the DGA method in order to complement the analyst’s decision-making process.

The acquisition of the waveform of the acoustic signal is made optionally by the equipment. Because of the length of each acoustic measurement, which in this work was 24 h(to contemplate the complete cycle of the transformer’s load), the storage of the acoustic signal becomes unfeasible, because, grossly, a total of: \(14 \text { sensors} \cdot 12 \,\text {bits} \cdot 10\,\text {Msps} \cdot 24\,\text {h} \cdot 60\,\text {min} \cdot 60\,\text {s} \approx 20.5\,\text { TBytes}\) would be needed for each equipment, by measurement. Therefore, it was decided to use data obtained from the waveform which are extracted from the parameters from Table 1. On each occurred acoustic activity, only these parameters, not the waveform, are stored in file, which significantly reduces the amount of stored data.

The most important parameters collected by the instrument that records the acoustic signals generated by PDs and collected by the sensors are presented in Table 1. The waveform of a typical acoustic signal waveform is shown in Fig. 1.

Fig. 1
figure 1

Representation of the activity (occurrence of an acoustic signal waveform) captured by the sensors

In possession of the data provided by the sensors, a procedure is necessary to distinguish between acoustic signals caused by noise and signals caused by PDs. Empirically, it can be noted that the most intense partial discharges occur in the voltage peaks, both positive and negative. It is observed that two activities with a time difference equal to \(1/(2 \cdot 60\,\text {Hz})\) s, equivalent to an angular difference of 180\(\circ \), can be considered as evidence of partial discharges. It is noteworthy that this calculation must be remade in the case of systems operating in a different frequency of \(60\,\text {Hz}\). In addition to this fact, employed sensors have a passband of around 150 kHz, which, for field measurements, is the most suitable range for capturing acoustic signals from partial discharges. In Fig. 2, the results of a measurement are presented with: total noise signals (represented by the blue color) and signals identified with evidence of PDs (represented in red), where count is the number of activities, mean is the average value of energy recorded, and %T is the percentage of minutes where activities were detected.

Fig. 2
figure 2

Variation in energy count as a function of monitoring time, for signals of: a total noise, without indication of PDs, and b with indication of PDs (Color figure online)

3 Bayesian Networks

Bayesian Networks are Directed Acyclic Graphs (DAG), which nodes have a finite set of states. For each node, there is a table of conditional probabilities that relates each of its values to each possible value from its parents. It is an appropriate reasoning model to classify defective equipment, as it allows easy detection of patterns if a suitable set of data is provided (Jensen and Nielsen 2007).

3.1 Bayes Rules

As the foundation of Bayesian Networks, Bayes’ theorem is a mathematical tool to calculate conditional probabilities. Its mathematical formula allows the update of unobserved variables probabilities, in light of observed variables. Knowing that P(x|e) is the probability of x given the evidence e, P(e|x) the likelihood P(x|e), P(x), and P(e) the prior probabilities of x and e, respectively, Bayes’ Theorem states that:

$$\begin{aligned} P(x|e) = \frac{P(e|x)P(x)}{P(e)} \end{aligned}$$
(1)

3.2 Structure Learning

The process of learning structures of BNs may be done in two ways: (i) by using a prior knowledge of a specialist or (ii) by using an automated tool for structure learning. It is desired to use the latter, as it is an unsupervised way of learning, which allows the detection, at each training, of the best structure for the selected data.

The simplest way to learn the structure of a BN would be to compare all combinations and find which one fits best. For this problem, the exhaustive search becomes impracticable, because the solution space grows super-exponentially as you add nodes (Rudolf Kruse 2013). Therefore, heuristics methods are used to find the best structure for the BN. This paper, therefore, suggests the use of two structure learning methods, K2 (Algorithm 1) and Hill-Climbing (Algorithm 2), selecting the one with the best score.

The K2 (Lerner and Malka 2011) is a greedy search method, which uses its own metric (3) to evaluate the structure. Be \(N_{ijk}\) the number of cases in the training database in which the variable \(u_i\) is instantiated with its kth value and his parents are instantiated with their jth value, \(\pi (u_i)\) the parents of \(u_i\), \(q_i\) the number of possible instances of \(\pi (u_i)\), and \(r_{u_i}\) the number of possible discrete values of \(u_i\):

$$\begin{aligned} N_{ij}&= \sum _{k=1}^{r_{u_i}}{N_{ijk}} \end{aligned}$$
(2)
$$\begin{aligned} g(i, \pi (u_i))&= \prod _{j=1}^{q_i}{\frac{(r_{u_i}-1)!}{(N_{ij} + r_{u_i}-1)!}} \prod _{k=1}^{r_{u_i}}{N_{ijk}!} \end{aligned}$$
(3)

where \(g(i, \pi (u_i))\) is the score of the training database given that the parents of \(u_i\) are \(\pi (u_i)\).

figure a

What makes the K2 attractive is also one of its drawbacks. To run the algorithm, the prior knowledge of the correct order of the nodes is necessary, which is not always possible. Therefore, additionally, the use of the Hill-Climbing algorithm (Jensen and Nielsen 2007) is suggested using the BIC metric (4) and thereby choosing the best structure. Be n the number of nodes and N the number of cases in the training base, D the database of known data and G the DAG to be scored; the BIC score is given as:

$$\begin{aligned} \mathrm{BIC}(G,D)&= \sum _{i=1}^n \sum _{j=1}^{q_i} \sum _{k=1}^{r_{u_i}} N_{ijk} \cdot \log _2 \left( \frac{N_{ijk}}{N_{ij}} \right) \nonumber \\&\qquad - \frac{\log _2 N}{2} \sum _{i=1}^n q_i \left( r_{u_i} - 1 \right) \end{aligned}$$
(4)

The BIC metric has the advantage of being equivalent (Darwiche 2009), i.e., equivalent structures have the same score. Consequently, when generating a new structure, it only requires to calculate the difference between the new and old one, without any need to calculate the score of the entire structure. The valid operations are add an arc, remove an arc, and invert the direction of the arc, always searching for acyclic structures.

figure b

A third way of constructing a structure is through the use of a naive classifier where all nodes have the same and single parent node.

In this study, two structures were generated, one with each method described (K2 and Hill-Climbing) and the best score was selected as (4). A naive classifier was also employed.

3.3 Parameters Learning

The MLE (Maximum Likelihood Estimation) and the EM (Expectation Maximization) are methods for learning parameters. The first operates with complete data, while the latter deals with incomplete data. As the sensors always provide all data related to the activity inside the equipment, the MLE method is recommended, as, according to Darwiche (2009), it is a counting problem, in the form of \(P(X|Y) = N(X, Y)/N(Y)\), where N(XY) is the number of cases where X and Y occur and N(Y) is the number of cases where Y occurs.

3.4 Inference Engine

Several inference engines in Bayesian Networks are available in the literature (Jensen and Nielsen 2007; Darwiche 2009). This work uses the method of variable elimination, capable of solving exact queries (the probability calculated is exact, not an approximate) of prior probabilities, posterior probabilities, most probable explanation (MPE), and maximum a posteriori probability (MAP), through successive elimination of variables during the process.

3.5 Discretization of Continuous Values

The parameterization of the BN requires knowledge of the conditional probability distributions of nodes and their parent nodes, which in the case of continuous variables, would not be a counting problem. Some papers address this problem. Driver and Morrell (1995) use a weighted sum of Gaussians to approximate the prior probabilities (root node) and conditional probabilities, and in Darwiche (2009); Lucas and Hommersom (2015) the continuous variables have to follow a Gaussian distribution.

Therefore, in this work, only discrete values are used. In order to properly and automatically establish the intervals of the parameters, a discretization method is proposed for continuous variables by assigning discrete values, with lower and upper boundaries, to continuous intervals, by using a Compact Genetic Algorithm adapted for solving mixed integer programming problems, the MIPcGA.

4 Compact Genetic Algorithm for Solving Mixed Integer Programming Problems

In a traditional Genetic Algorithm (GA), individuals compete against each other and the allele frequency can increase or decrease depending on the good or bad decisions made by the GA (Harik et al. 1999). In the cGA, the selection occurs unlike a GA, where genes are evaluated in the context of individuals as a whole. Let us assume two solutions, a and b, that are competing with each other in an onemax problem, where the objective is to maximize the number of ones in a binary string (Prugel-Bennett et al. 2015).

Table 2 Fitness and binary representation of each solution

As presented in Table 2, during the competition, the solution a will win by having a greater fitness. The problem in this approach happens in position 4 of the chromosome, where an error occurs by the bit 0 being selected instead of the bit 1. Therefore, Harik et al. (1999) propose a similar schema to a tournament selection: On each iteration, the proportion of winners’ genes will grow by 1 / n, being n the size of the population, benefiting the best genes on each tournament.

By using an appropriate procedure for codification of integer and real variables into bits and a procedure based on Harik et al. (1999) for constraint handling, this paper proposes the use of the cGA tailored for solving mixed integer programming problems, the MIPcGA, for the discretization of continuous variables used in the BN classification process.

The MIPcGA is presented in Algorithm 3. It is noteworthy that between two solutions, the algorithm selects the candidate with a lower constraint violation, or, if they have the same amount, it selects the one with the best evaluation of the objective function. This procedure is similar to Deb (2007). Therefore, individual solutions are compared using the following rules:

  1. 1.

    A solution that violates a smaller quantity of constraints is always preferred over a solution that violates more;

  2. 2.

    Among two feasible solutions, the one having better fitness is preferred.

  3. 3.

    Among two solutions that violate the same quantity of constraints, the one having smaller constraint violation is preferred.

Considering the algorithms proposed, the problem of discretization of continuous parameters of the BN is defined as:

$$\begin{aligned} \text {Maximize }&f(X), \\ \text {Subject to }&l_i \le x_i \le u_i, \\&a_{ij} \in \{0, 1\}, \\&x_{ij} \in \mathbb {Z}. \end{aligned}$$

where f(X) is the classification rate of the BN, \(a_{ij}\) represents the bit to enable or to disable the jth boundary of the ith continuous parameter \(x_{ij}\), \(l_i\) and \(u_i\) are the minimum and maximum values of this parameter, and X is the set containing all \(x_{ij}\).

Using MIPcGA for discretization of continuous variable implies a solution to a problem of 2n variables, where n is the maximum number of intervals. An example of a valid solution is shown in Table 3 and its discrete representation in Table 4, where 0 is the lowest value and 1756 is the highest possible value for the variable. If the integer variable is 0, the real variable is not used to compose the boundaries (it is disabled). If it is one, the real variable composes the boundaries.

To apply the discretization method, the steps below are applied:

  1. 1.

    Learn a BN using an initial solution;

  2. 2.

    Apply the MIPcGA algorithm to enable/disable the boundaries for continuous parameters, selecting genes which maximize the classification rate of the BN. At each generation of the algorithm, new BNs are learned and compared using the selected boundaries.

Table 3 Example of valid values of boundaries
Table 4 Example of discrete values
figure c

To evaluate the solutions, in this paper, the steps presented in Algorithm 4 were used to calculate the objective function.

figure d

5 Power Transformers Classification Using Acoustic Emission Measurements Data

A method is proposed, in this study, for qualitative health monitoring of the power transformers using the procedures described and measurements of acoustic sensors.

In order to assist the decision-making process, classifications (from “A” to “D” ) were used, with their respective recommended actions, as presented in Table 5 (Marques et al. 2016). The parameters from Table 6 were used in the construction of the BN, which are extracted from data obtained from the measurement (Table 1), as a result of the performed studies described in Marques et al. (2016).

Table 5 Possible classifications
Table 6 Information used for classification

6 Results and Discussion

To evaluate the method, 40 measurements performed in power transformers of the company Enel Distribuiçá Goiás were used, with power ranging from 32.0to 75.0 MVA, with diversified structural designs. The data were collected during AE measurements performed in the field, with the transformer in operation, during 24 h, i.e., a complete load cycle, to ensure all different operating conditions were evaluated.

Fourteen sensors from a DISP device of Physical Acoustics Corporation were used, with a sampling rate of 10 Msps and operating around 150 kHz, positioned at strategic parts of the equipment, requiring full knowledge of the geometry of the design of the active part, tank, and bushings of the power transformer. Each sensor of each measurement was classified, resulting in 560 cases composed by the parameters presented in Table 6. The approach presented in Sect. 2 to distinguish between signals of noise and partial discharges detected by the sensors was used. The optimization algorithm seeks to generate a classification method capable of providing results equal to those given by the specialists who pre-classified the available data set. Therefore, there were 392 samples for training the BNs and 168 samples to validate (70–30%), including all possible classifications, from “A” to “D.” Algorithm 4 was applied 30 times, both to the structure learning (Sect. 3.2) and to the naive classifier (BN-Naive), and for each BN, the performance was verified 100 times, by selecting the learning and validation groups at random. To the continuous values, log 10 was applied, as it reduces the scale of the search space of the MIPcGA optimizer.

By using structure learning, among the 30 BNs created, a correct classification average of 79.94% was obtained, with the average maximum of 85.67% and the minimum of 72.25%. Figure 3 shows all executions of the algorithm. The BN-Naive obtained a correct classification average of 81.82%, with the average maximum of 88% and the minimum of 77.33%. Figure 4 shows all executions of the algorithm using a naive classifier.

For comparison purposes, Multilayer Neural Network (MLP) trained using the Levenberg–Marquardt algorithm (Mohamad et al. 2010) and Decision Tree (DTree) created with the CART algorithm (Kuhn and Johnson 2013) were used. Different combinations of neurons and layers were tested, for the MLP, and it was found that, for this specific problem, one inner layer with five neurons was the best configuration. The process of classification was run 100 times, and the measurements data were split into two groups, learning and validation, of 70 and 30% of the data, respectively. The number of neurons and inner layers of the MLP was varied, confirming that, for this problem, an inner layer with five neurons was the best setting. The correct classification mean of the MLP was 73.67%, with its minimum value of 8.33% and the maximum of 100%. The correct classification mean of the Decision Tree illustrated in Fig. 5 was 77.17%, with its minimum value of 33.33% and the maximum of 100%. The application log 10 in continuous values showed no improvement on the results of the DTree and the MLP.

Fig. 3
figure 3

Evaluation of the algorithm using structure learning

Fig. 4
figure 4

Evaluation of the algorithm using a naive classifier

Table 7 shows a comparison of correct classification rate of the algorithms. To choose the best BN, the intervals which lead to the best average rate of correct classifications were used.

Table 7 Comparison of the classification algorithms
Fig. 5
figure 5

Example of a decision tree used for classification

Figure 6a, b shows structures of a naive classifier and the BN using the structure learning algorithm, chosen at random. Tables 8 and 9 show boundaries found by the MIPcGA.

Fig. 6
figure 6

BNs’ structures: a Naive classifier, b structure learned by the algorithm

Table 8 Discrete values of one of the 30 naive classifiers chosen at random
Table 9 Discrete values of one of the 30 BN using structure learning chosen at random

7 Conclusion

This paper presented a method capable of learning a suitable Bayesian Network for classification of the health condition of power transformers, considering acoustic emission measurements data. Such network is able to indicate the classification of the equipment, but also the percentage of the belief in the result, given the evidence. A solver was proposed, for the discretization of continuous values that feed the Network, based on the Compact Genetic Algorithm for solving mixed integer optimization problems with restrictions. The proposed method involves the learning of a BN and the discretization using the MIPcGA. Running a metaheuristic algorithm can be time-consuming; however, the algorithm only needs to be used as the BN needs to be relearned. Therefore, after the BN is learned, quick classifications can be performed without running the solver.

As comparison criteria, Multilayer Perceptron Neural Networks and Decision Trees were created, both are suitable methods for a classification problem, and it was found that the BN achieved superior performance. Therefore, the proposed method presented better results, both in the worst case, as on average when compared to the MLP and DTree and in the best scenario, showed superior results to the MLP and the same result as the DTree.

The results indicated BN-Naive as the most suitable for the classification process, and then the BN with structure learning. It is noteworthy that the method is suitable for other classification problems, as it can use both discrete and continuous parameters and, also, use structures indicated by an expert, a BN-Naive, or a BN with structure with learning, provided that there is an adequate set of learning data.