1 Introduction

The power transformer plays a vital role in power systems, and it serves as the connection of transmission and distribution networks at different voltage levels [1]. Once it fails, it will cause power interruption, which seriously influence economic production and living activities [2]. Thus, it is of great necessity to detect and diagnose incipient fault in power transformers [3].

In recent years, diagnosis incipient fault in power transformers has been studied by many scholars and experts and transformer fault diagnosis problem is solved with different approaches. The aim of all these studies is to meet the reliability requirements of power supply in order to minimize device damages and reduce economic, social and personal losses. Various approaches have been used to solve transformer fault diagnosis problem such as frequency response analysis [4], the vibration analysis [5], dissolved gas analysis (DGA) [6], etc.

Among these methods, dissolved gas analysis (DGA) is a widely used technique to detect the incipient faults in power transformers [7, 8]. Many researchers have put forward a large number of DGA-based methods. Traditional methods based on DGA such as IEEE key gases [9], Rogers ratios [10], IEC standard code [11], Dornenburg ratios [12], Duval triangle [13], IEC 60599 [14], have been applied to transformer fault diagnosis.

In addition to these traditional methods, artificial intelligence (AI) methods based on DGA have been recently used in the field of transformer fault diagnosis, including artificial neural networks(ANN) [15], support vector machine(SVM) [16], fuzzy logic (FL) [17], adaptive neuro fuzzy inference system (ANFIS) [18],random forest (RF) [19], Bayesian network [20], gene expression programming(GEP) [21], time series analysis [22], expert system [23], association rule [24], set pair analysis [25], evidential reasoning [26], etc. The various proposed methods have produced good results and made an important contribution in the field of transformer fault diagnosis. These methods tend to build models that can bridge the relationship between machines data and machines health states [27].

Nowadays, deep learning theories have also been proposed for transform fault diagnosis with the development of calculation capacity [28]. Deep learning can realize end-to-end learning and avoid tedious feature engineering and build a black box model compared with traditional AI methods. However, it requires a lot of iterations and weight updates with time-consuming gradient computing. A new model based on extreme learning machine (ELM) was introduced to identify the state of transformer to solve the efficiency degradation caused by gradient update [29]. The ELM model outperformed ANN and other traditional methods. However, stochastic selection of the number of hidden layer nodes is easily leads to low stability and poor robustness of the model in the application [30].

By analyzing the reviewed articles, despite different diagnosis models have been presented by various scholars and experts, a precise diagnosis model is still needed yet. For this purpose, a novel fault diagnosis model based on HHO-KELM for power transformers is proposed in this study.

The contributions of this study are as follows: (1) Non-code ratios set is selected as the characterizing vector for KELM model. (2) A parameter optimization method based on HHO is adopted to obtain the optimal model parameters of KELM automatically. (3) The established HHO-KELM model is applied to diagnose incipient fault in power transformers.

The remaining parts of the paper are organized as follows: Sect. 2 describes the proposed method; Sect. 3 presents the proposed fault diagnosis approach based on HHO-KELM; In Sect. 4, the results and analysis discussion are given. Section 5 concludes the summary.

2 The Proposed Method

  1. A.

    Kernel Extreme Learning Machine

Kernel extreme learning machine (KELM) [30] is a kind of machine learning algorithm with strongest ability and generalization ability which is proposed on the basis of ELM and kernel functions. On the basis of ELM, the learning method of KELM is described as follows:

Given N different samples \((x_{j} ,t_{j} ) \in R^{n} \times R^{m}\), \(j = 1, \ldots ,N\), where \(x_{j} = [x_{j1} ,x_{j2} , \ldots ,x_{jn} ]^{T}\) is the input samples, \(t_{j} = [t_{j1} ,t_{j2} , \ldots ,t_{jm} ]^{T}\) is the encoded class label. The mathematical model of ELM can be described as below:

$$\sum\limits_{i = 1}^{L} {\beta_{i} g(x_{j} ,w_{i} ,b_{i} ) = } t_{j} ,\quad \, j = 1, \ldots ,N$$
(1)

where, L is the neurons number in the hidden layer, \(w_{i} ,b_{i} ,\beta_{i}\) are the input weights, hidden layer bias and output weights of the ith hidden neuron node respectively, \(g(x_{i} ,w_{i} ,b_{i} )\) is the activation function. Equation (1) can be written in a matrix form as:

$${\mathbf{HB}} = {\mathbf{T}}$$
(2)

where \({\mathbf{B}} = [\beta_{1} , \ldots ,\beta_{L} ]^{T}\) represents the vector of output layer weights, \({\mathbf{T}} = [t_{1} , \ldots ,t_{N} ]^{T}\) is the corresponding coding class label, and H is the hidden layer output matrix:

$${\mathbf{H}} = \left[ \begin{gathered} h(x_{1} ) \hfill \\ \, \vdots \hfill \\ h(x_{N} ) \hfill \\ \end{gathered} \right] = \left[ \begin{gathered} g(x_{1} ,w_{1} ,b_{1} ) \, \cdots \, g(x_{1} ,w_{L} ,b_{L} ) \, \hfill \\ \, \vdots \, \vdots \, \vdots \hfill \\ g(x_{N} ,w_{1} ,b_{1} ) \, \cdots \, g(x_{1} ,w_{L} ,b_{L} ) \hfill \\ \end{gathered} \right]$$
(3)

The training of ELM is equivalent to solving the least squares solution of linear Eq. (2), which is defined as:

$$\left\| {{\mathbf{H}}(w_{1} , \ldots ,w_{L} ,b_{1} , \ldots ,b_{L} ){\hat{\mathbf{B}}} - T} \right\| = \mathop {\min }\limits_{{\beta_{i} }} \left\| {{\mathbf{H}}(w_{1} , \ldots ,w_{L} ,b_{1} , \ldots ,b_{L} ){\mathbf{B}} - {\mathbf{T}}} \right\|$$
(4)

The least square solution of the output weight is calculated as:

$$\mathop {\mathbf{B}}\limits^{ \wedge } = {\mathbf{H}}^{T} (I/C + {\mathbf{HH}}^{T} )^{ - 1} T$$
(5)

where C is the regularization parameter. The output function for ELM is implemented as:

$$f(x) = h(x)\mathop {\mathbf{B}}\limits^{ \wedge } = h(x)H^{T} (I/C + HH^{T} )^{ - 1} T$$
(6)

When the hidden layer function h(x) is unknown, the kernel function matrix is calculated as:

$$\left\{ {\begin{array}{*{20}l} {\Omega_{{{\text{ELM}}}} = {\mathbf{HH}}^{T} } \hfill \\ {\Omega_{ij} = h(x_{i} ) \cdot h(x_{j} ) = K(x_{i} ,x_{j} )} \hfill \\ \end{array} } \right.$$
(7)

where \(K(x_{i} ,x_{j} )\) represents the kernel function. The radial basis function (RBF) is applied in this paper, which is given by:

$$K(x_{i} ,x_{j} ) = exp\left( { - \frac{{\left\| {x_{i} - x_{j} } \right\|^{2} }}{\sigma }} \right)$$
(8)

Then, the output function of KELM can be obtained as:

$$f(x) = h(x)H^{T} (I/C + HH^{T} )^{ - 1} T \, = \left[ \begin{gathered} K(x,x_{1} ) \hfill \\ \, \vdots \hfill \\ K(x,x_{N} ) \hfill \\ \end{gathered} \right]^{T} (I/C + \Omega_{ELM} )^{ - 1} T$$
(9)
  1. B.

    Harris Hawks Optimization Algorithm

Harris hawks optimization (HHO) is a newly introduced metaheuristic optimization technique by Heidari in 2019 [31], which contains the phases of exploration and exploitation.

  1. 1.

    Exploration Phase

In this stage, Harris hawks detect the prey through two strategies. The first strategy assumes that the hawks allocate close to the team members and the prey. In the second strategy, the hawks place on random trees. These two strategies can be described as:

$$X(t + 1) = \left\{ {\begin{array}{*{20}l} {X_{rand} (t) - r_{1} \left| {X_{rand} (t) - 2r_{2} X(t)} \right|} \hfill & {K \ge 0.5} \hfill \\ {(X_{prey} (t) - X_{m} (t)) - r_{3} (LB + r_{4} (UB - LB))} \hfill & {K < 0.5} \hfill \\ \end{array} } \right.$$
(10)

where X(t) and X(t + 1) represent the hawk's positions at the current iteration t and the next iteration t + 1, respectively. Xprey(t) is the position of prey. Xrand (t) is a randomly selected hawk from the current position. r1,r2,r3,r4 and K are randomly generated number between [0, 1]. LB and UB are the lower and upper bounds of control variables. Xm(t) is the average position of hawks which calculated as:

$$X_{m} (t) = \frac{1}{N}\sum\limits_{i = 1}^{N} {X_{i} (t)}$$
(11)

where, N is the total number of hawks. Xi(t) is the position of each hawk in iteration t.

  1. 2.

    Ransition from Exploration to Exploitation

The change from the exploration to exploitation in the HHO can be modeled based on the prey escaping energy as:

$$E = 2E_{0} \left( {1 - \frac{t}{T}} \right)$$
(12)

where, E indicates the escaping energy of the prey. E0 is the initial energy of the prey which is generated randomly between [−1, 1]. T is the maximum number of iterations. Therefore, if E < 1, this means the prey is weak, hence the hawks should exploit close to the prey location, and if E > 1, this means the prey has enough energy to escape, so the hawks should continue exploring the prey location.

  1. 3.

    Exploitation Phase

The exploitation process of HHO has been modeled based on the escaping energy of prey E and its chance for escaping r. There are four chasings besiege in the exploitation phase of HHO.

a: Soft Besiege (\(r \ge 0.5\) and \(\left| E \right| \ge 0.5\)).

In soft besiege phase, the prey tries to escape using random jumps, but the hawks surround it softly. This process can be expressed as:

$$X(t + 1) = \Delta X(t) - E\left| {KX_{prey} (t) - X(t)} \right|$$
(13)
$$X(t + 1) = X_{prey} (t) - X(t)$$
(14)

where \(\Delta X(t)\) is the distance between the prey position and the hawks’ position in iteration t. K = 2(1-r5) indicates the strength of the prey randomly jumping during the escaping. r5 is a random number ranged from [0, 1].

b: Hard Besiege (\(r \ge 0.5\) and \(\left| E \right| < 0.5\)).

In hard besiege mode, the prey becomes too exhausted to escape. As a result, the hawks effortlessly catch the prey, and then pounce on it. This move can be modeled as:

$$X(t + 1) = X_{prey} (t) - E\left| {\Delta X(t)} \right|$$
(15)

c: Soft Besiege with Progressive Rapid Dives (\(r < 0.5\) and \(\left| E \right| \ge 0.5\)).

In soft besieges with progressive rapid dives approach, the prey able to escape and the hawks softly surround it. To model this besiege, a levy fight (LF) concept is used as follows:

$$H = X_{prey} (t) - E\left| {KX_{prey} (t) - X(t)} \right|$$
(16)

where H is the soft besiege positions. The hawks dive based on the LF as:

$$G = H + S \times LF(D)$$
(17)

where D is the dimension of problem. S is a random vector by size 1 × D. The LF is calculated as:

$$LF(x) = 0.01 \times \frac{u \times \sigma }{{\left| v \right|^{{\frac{1}{\beta }}} }}$$
(18)
$$\sigma = \left( {\frac{{\Gamma (1 + \beta ) \times \sin \left( {\frac{\pi \beta }{2}} \right)}}{{\Gamma \left( {\frac{1 + \beta }{2}} \right) \times \beta \times 2^{{\left( {\frac{\beta - 1}{2}} \right)}} }}} \right)^{{\frac{1}{\beta }}}$$
(19)

where\(\beta\) is a constant value set to 1.5. u and v are random values between [0,1].

Finally, the updating position of hawks is calculated as:

$$X(t + 1) = \left\{ {\begin{array}{*{20}c} H & {if\;F(H) < F(X(t))} \\ G & {if\;F(G) < F(X(t))} \\ \end{array} } \right.$$
(20)

d: Hard Besiege with Progressive Rapid Dives (\(r < 0.5\) and \(\left| E \right| < 0.5\)).

In hard besieges with progressive rapid dives case, the behavior of the prey is similar to its behavior in the case of a soft besiege. The position of team members is updated by reducing the distance between their average position and the prey position. This motion is described as:

$$X(t + 1) = \left\{ {\begin{array}{*{20}l} {H^{^{\prime}} } \hfill & {if\;F(H^{^{\prime}} ) < F(X(t))} \hfill \\ {G^{^{\prime}} } \hfill & {if\;F(G^{^{\prime}} ) < F(X(t))} \hfill \\ \end{array} } \right.$$
(21)
$$H^{^{\prime}} = X_{prey} (t) - E\left| {KX_{prey} (t) - X_{m} (t)} \right|$$
(22)
$$G^{^{\prime}} = H^{^{\prime}} + S \times LF(D)$$
(23)

where Xm(t) is obtained from Eq. (11).

  1. C.

    The Proposed HHO-KELM Method

The diagnosis performance of KELM is affected by the penalty factor C and kernel parameter σ. In order to obtain the optimal parameters of the KELM, we optimize the penalty factor C and the parameter σ of the kernel function by HHO. The specific steps of HHO optimizing KELM parameters are described as follows:

Step 1: Initialize the KELM parameters C, σ and the parameters of HHO, including the number of search agents N, the number of iterations T and D-dimensional space.

Step 2: Calculate the objective function value of each hawks through the Eq. (9) and update the position of Xprey. The fitness function of the HHO in the optimization process is the mean square error (MSE), which is:

$$MSE = \frac{1}{M}\sum\nolimits_{i = 1}^{M} {(y_{D} (i) - y(i))^{2} }$$
(24)

where yD and y represent the measure value and the observed value. M represents the number of observation samples.

Step 3: Update the initial energy, jump strength and escaping energy of each individual in each iteration.

Step 4: Determine whether the number of iterations is equal to the maximum value. If it is not, execute t = t + 1 and return to step 2. Otherwise, go to step 5.

Step 5: Export HHO optimal solutions (values of C and σ).

The flow chart of this procedure is described in Fig. 1.

Fig. 1
figure 1

The flowchart of HHO optimizing KELM parameters

3 Transformer Fault Diagnosis

3.1 Model Based on HHO-KELM

  1. A.

    Transformer Fault Types and Input Characterizing Vector Selection

Faults inside transformers are classified in thermal faults and electrical faults according to [32]. Thermal faults are low and medium thermal (LM-T) and high thermal (H-T) categorized by means of temperature in the range of TLM-T < 700 °C and TH-T > 700 °C, respectively. Electrical faults are low-energy discharge (LE-D) and high-energy discharge (HE-D). In addition this work takes the normal state (NS) into account to distinguish between normal mode and failure mode.

The above incipient faults caused by oil and insulation paper decomposition can generate several kinds of gases: hydrogen (H2), methane (CH4), acetylene (C2H2), ethylene (C2H4), ethane (C2H6), carbon monoxide (CO) and carbon dioxide (CO2). Based on this gases, Non-code ratios are taken as the input characterizing parameter of the diagnostic model in this paper. Non-code ratios include CH4/H2, C2H4/C2H2, C2H4/C2H6, C2H2/(C1 + C2), H2/(H2 + C1 + C2), C2H4/(C1 + C2), CH4/(C1 + C2), C2H6/(C1 + C2) and (CH4 + C2H4)/(C1 + C2). Of these, C1 represents CH4 and C2 represents the sum of C2H4, C2H6 and C2H2.

  1. B.

    Transformer Fault Diagnosis Model Based on HHO-KELM

In this study, a fault diagnosis model based on HHO-KELM for transformers is established. It can be divided into three parts: data collection and processing, the training of the network, and the performance testing and evaluation of the network model.

In the data collection and processing part, some dissolved gas content of transformer is firstly collected from the transformer faults records as DGA data, the collected DGA data are then transformed to non-code ratios. When the ratio with 0/0 is set to “0” and c/0 to 20, where c is not “0” [33].The obtained DGA ratios are then normalized to clear up the influence of order of magnitude, shown as:

$$x_{si} = \frac{{x_{i} - x_{i\min } }}{{x_{i\max } - x_{i\min } }}$$
(25)

where xsi and xi are the DGA ratios before and after normalization, ximax and ximin and are the maximum and minimum value of a DGA ratio.

In the network training part, a training dataset is constructed by using DGA records with transformer fault types. The characterizing vectors are input to the KELM network. The KELM adopts the HHO algorithm to obtain the model parameters. The learnt feature shows the relationships between the DGA ratios and the corresponding fault patterns.

After the training process, the diagnostic parameters are obtained. Then, the types of faults are classified for the samples in the testing set or the transformer under investigation with the trained diagnostic parameters. Accuracy is taken into consideration to appraise the capability of fault diagnosis approach which is computed by:

$$A = \frac{{N_{correct} }}{{N_{total} }}$$
(26)

where Ncorrect is the number of samples correctly diagnosed, Ntotal is the total number of samples in the whole dataset.

The implementation process of the proposed fault diagnosis model is described as follows:

Step1: Obtain the DGA data of transformers through the acquisition device.

Step2: Transform DGA data to Non-code ratios.

Step3: Normalize the obtained DGA ratios and construct a feature vector.

Step4: Divide the data into a training set and testing set. The training set is used to train a diagnostic model.

Step5: The testing set is selected to test and evaluate the performance of the proposed model.

The flow chart of the proposed transformer fault diagnosis model is shown in Fig. 2.

Fig. 2
figure 2

The flow chart of the proposed fault diagnosis model

4 Results and Analysis

  1. A.

    Test Data

DGA samples are collected from IEC TC 10 database [11], as shown in Table 1. These samples are used to evaluate the performance of the proposed fault diagnosis method.

  1. B.

    Model Parameters

Table 1 Test data of IEC TC 10 database

The relevant parameters of HHO and KELM for transformer fault diagnosis are detailed in Table 2.

  1. C.

    Transformer Fault Diagnosis Results

Table 2 Parameters of KELM and HHO

In order to investigate the accuracy of HHO-KELM for transformer fault diagnosis, 93 samples out of total 118 cases are used to train KELM structure parameters and the rest 25 samples (5 samples of each fault) are employed to test diagnosis capability of the trained KELM.

Figure 3 reports the typical HHO fitness variation curves. It is found that the converge time is 11.2693 s and the fitness value (mean square error) reaches 0.1708 after the second iteration, and then remain flat for a while. After the fifth iteration, the fitness value is reduced to 0.1630, which is the best value. This illustrates that the HHO algorithm can converge to the best solution very quickly.

Fig. 3
figure 3

The fitness variation curves of HHO

The diagnosis results based on HHO-KELM are present as Fig. 4. It can be seen that HHO-KELM has four errors in training samples, and the classification accuracy is 95.7% (89/93). In the test samples, three errors are generated in the type of faults, and the classification accuracy is 88% (22/25). This indicates that the proposed method has good accuracy.

  1. D.

    Comparisons with Different Input Characterizing Vector

Fig. 4
figure 4

The diagnosis results using HHO-KELM model. a The diagnosis results on training samples. b The diagnosis results on test samples

In order to further determine the proposed method, the input characterizing vectors of HHO-KELM are divided into four categories: (1)All the DGA data (H2, CH4, C2H2, C2H4, C2H6, CO, CO2); (2) IEC Rations (CH4/H2, C2H2/C2H4 and C2H4/C2H6); (3) Rogers Ratios (CH4/H2, C2H2/C2H4, C2H4/C2H6 and C2H6/CH4); (4)Non-code Rations. The above four different types of characterizing parameters are modeled in KELM for diagnosis and the fault diagnosis results are reported in Table 3 and Fig. 5.

Table 3 HHO-KELM diagnosis results using different characterizing vectors
Fig. 5
figure 5

HHO-KELM diagnosis results using different characterizing vectors

As seen in Table 3 and Fig. 5, HHO-KELM with Non-code ratios can detect LM-T better than others. HHO-KELM with IEC ratios and Rogers ratios can not detect LM-T properly. HHO-KELM with Non-code ratios has the best performance for H-T as well. The accuracy of HHO-KELM with Non-code ratios for LE-D is same to HHO-KELM with IEC ratios and Rogers ratios. All methods except HHO-KELM with all the DGA data, can detect HE-D and N-S properly.

In short, the HHO-KELM with Non-code ratios shows the highest total accuracy. With the input vector of Non-code ratios, the average accuracy of HHO-KELM is increased significantly compared to HHO-KELM with all the DGA data, IEC ratios and Rogers ratios, the average accuracy for testing samples is increased by 60%, 32% and 24%, respectively.

  1. E.

    Comparisons with Different Diagnostic Methods

In order to verify the diagnosis performance, the ELM, KELM, GA-KELM and PSO-KELM with the input vector of Non-code ratios are selected to compare with the HHO-KELM.

Figure 6 shows the relationship between the fitness value (mean square error) and the number of iterations for different algorithms in the optimization process for KELM. It can be seen from Fig. 6 that the HHO-ELM model reduces the number of iterations and can find a stable solution close to the best goal. The reason the original ELM, KELM are not represented in the iteration number versus fitness plot is that the original ELM, KELM did not incorporate an intelligent optimization algorithm and did not ensure fitness.

Fig. 6
figure 6

The relationship between the fitness value and the number of iterations for different algorithms

The fault diagnosis results for five different methods are shown in Table 4 and Fig. 7.

Table 4 Diagnosis results using different methods
Fig. 7
figure 7

Diagnosis results using different methods

According to Table 4 and Fig. 7, HHO-KELM has better performance than other methods for detection of LM-T. The worst method for detecting LM-T is ELM and KELM. ELM, PSO-KELM, GA-KELM and HHO-KELM can effectively diagnose H-T fault. However, HHO-KELM method is weak in detecting LE-D and is lower than PSO-KELM. All methods have better performance for HE-D. For N-S, GA-KELM and HHO-KELM have better accuracy than ELM, KELM and PSO-KELM.

It can be concluded that HHO-KELM has the best average accuracy. Compared with ELM, KELM, GA-KELM and PSO-KELM, the average fault diagnosis accuracy of HHO-KELM is increased by 28%, 24%, 8%, and 4% for testing set, which show that the HHO-KELM has higher classification accuracy than those of the other compared method.

  1. F.

    Fault Diagnosis Using China Transformers DGA Data

In order to investigate fault diagnosis capability of the proposed method comprehensively, 180 DGA samples of transformers in China are diagnosed by HHO-KELM. The180 samples are divided into two parts, of which 130 samples are used to train HHO-KELM model and the other 50 samples (10 samples of each fault) are tested with the trained model. Figure 8 presents a high accuracy of 90% (46/50) for China DGA data, which shows that the proposed method has better performance.

Fig. 8
figure 8

Fault diagnosis results of DGA samples in China

5 Conclusion

In this work, a fault diagnosis method based on HHO-KELM is presented to diagnose the different fault states of transformers. The proposed model is solved using non-code ratios as the input vector, Furthermore, HHO algorithm is implemented to optimize the KELM model parameters. Finally, some numerical analysis are proposed in order to demonstrate the validity of the proposed strategy. The comparison results show that the average test accuracy of the Non-code ratios feature set is 60%, 32% and 24% higher than that of the DGA full data, the IEC ratios and the Rogers ratios feature quantity respectively. Also, the accuracy of the HHO-KELM model for transformer fault diagnosis reaches 88% which is higher than the model of ELM, KELM, PSO-KELM model and GA-KELM. Considering the presented results, it can be said that the proposed method outperforms all compared diagnosis models and is an accurate model for solving such problems.

However, some shortcomings still existence in the study, such as the insufficient data samples. Additional datasets could be considered in future work in order to validate the proposed method by means of fault diagnosis accuracy.