1 Introduction

Power transformer is one of the major apparatus in transmission and distribution system. The failure of power transformer causes huge economic losses to industry and inconvenience to the general public. In power transformers, two types of insulations are used liquid insulation (mineral oil or transformer oil) and solid impregnated insulation (cellulose). Out of the two, the liquid insulation is very important. The transformer oil provides electrical insulation, dissipates heat, helps to preserve the core and winding and prevents the direct contact of atmospheric oxygen with cellulose-made paper insulation of windings [16].

The thermal and electrical stresses decompose the transformer oil, produce harmful gases such as hydrogen (H2), methane (CH4), acetylene (C2H2), ethylene (C2H4) and ethane (C2H6), and subsequently, the oil gets damaged. This declines the performance of transformer oil. One can prevent this by knowing the exact amount of harmful gases dissolved in the transformer oil. The different conventional methods such as Roger’s ratio method, Dornenburg’s method, Duval’s triangle method and key gas ratio methods are used to ascertain the exact amount of harmful gases dissolved in the transformer oil. These methods are either inconclusive on fault or give a false fault type [711]. To overcome these uncertainties in conventional methods, various intelligent methods such as artificial neural networks [12, 13], Wavelet Analysis [14], Least Vector Quotient [15], Probabilistic Neural Network (PNN) [16], fuzzy logic [1719] Support Vector Machine classifiers [2023] and Self-Organizing Map classifiers [24, 25] have been proposed.

This article deals with fault classification in power transformers using Backpropagation Neural Network (BPN) and PNN. The merits of BPN classifier include fair approximation of a large class of functions, relatively simple implementation, and mathematical formula used in BPN algorithm can be applied to any network. On the other hand, PNN classifier can generate accurate predicted target probability scores. They are insensitive to outliers and have faster convergence speed than BPN classifiers.

Performance of BPN and PNN classifiers has been compared for the transformer fault classification. PNN classifier gives better results compared to BPN classifier. The comparison has been carried out amongst other BPN architectures too to classify transformer faults.

2 Proposed fault classification scheme

Figure 1 shows the flow chart of the proposed transformer fault classification scheme. The raw data are collected and preprocessed. After that, the dimension reduction and feature selection are performed and in the last step of the classification, neural network-based classifiers have been applied to determine different faults.

Fig. 1
figure 1

Block diagram of proposed transformer fault classification scheme

3 Methodology

3.1 Dissolved gas analysis (DGA)

Dissolved gas analysis (DGA) provides advance warning of developing faults. Some of the methods used in industry for DGA are IEEE std. C 57.104:1991, IEC std. 60599:1999, Duval’s triangle, CIGRE, Nomograph methods. IEEE std. C 57.104:1991 and IEC std. 60599:1999 methods are key gas ratio methods [26, 27]. Ratio method does not cover the entire range of data. So, the fault classification sometimes gives no results. CIGRE method is combination of key gas ratio method and gas concentration method [28, 29]. Duval’s triangle method and Nomograph method are graphical methods. These methods have a limitation. When multiple DGA faults occur in the system, none of these methods are able to detect it. Table 1 shows the allowable range of harmful gases (in ppm) in transformer oil for OLTC and commutating OLTC [30]. Table 2 shows different faults of power transformer defined by a combined IEC/IEEE and CIGRE criteria [26].

Table 1 Allowable range of harmful gases (in ppm) in transformer oil for OLTC and commutating OLTC as per IEC 60599
Table 2 Combined criterion of IEC/IEEE and CIGRE standards for integrating fault types

3.2 Data collection

Transformers from ten substations of Punjab State Electricity Board, Patiala (India), are used to collect gas samples. The data are collected as per the ASTM standards. The transformer rating ranges from 52–63 MVA. The range of voltage is 132/33/11 kV. After the data collection, they are preprocessed by removing linear trends, outliers, etc. Table 3 shows the preprocessed data of samples obtained from Punjab State Electricity Board. The raw gas data, collected from different transformers, are statistically analyzed. The variance plot of each of the gas sample has been represented by ANOVA plot in Fig. 2. For normalization purpose, the mean value is subtracted from each point, i.e., the data can be said zero-mean data.

Table 3 Preprocessed samples of data of dissolved gases in power transformers of Punjab state electricity board
Fig. 2
figure 2

The variance plot of raw data collected from different transformers

3.3 Feature selection

The process of mapping original features of the data into fewer, more effective features is called feature extraction. Linear Discriminant Analysis (LDA) and Principle Component Analysis (PCA) are some of the well-known feature extraction methods [31, 32]. LDA is a supervised feature extraction technique, whereas PCA is unsupervised feature extraction technique. Principal component analysis (PCA) also known as Karhunen–Loeve transform is one of the most popular statistical technique, which reduces the dimensions of a dataset but preserves the correlation structure in the data. It is basically used for feature extraction. Steps of PCA are as follows:

  1. (i)

    Get the input data

  2. (ii)

    Calculate the mean of data and subtract the mean

  3. (iii)

    Calculate the covariance \(\text{cov} \left( {x,y} \right) = \frac{{\sum\nolimits_{i = 1}^{n} {\left( {X_{i} - \bar{X}} \right)\left( {X_{i} - \bar{X}} \right)} }}{n - 1}\) where X i , i.e., input, is the DGA Dataset, i = 1–600, \(\bar{X}\) is the mean value of the dataset and y (i.e., output) is the fault type of the transformer

  4. (iv)

    Calculate the eigenvector and eigenvalue of covariance matrix

  5. (v)

    Choosing components and forming a feature vector

  6. (vi)

    Derive new dataset from following formula:

$${\text{Final data}} = {\text{Row Feature Vector}} \times {\text{Row Data Adjust}}$$

where Row Feature Vector is the matrix with eigenvectors in the columns transpose and Row data Adjust is the mean-adjusted transpose. An assumption made for feature extraction and dimensionality reduction by PCA is that most information of the observation vectors is contained in the subspace spanned by the first m principal axes, where m < p for a p-dimensional data space. Therefore, each original data vector can be represented by its principal component vector with dimensionality m.

4 Results and discussion

The Backpropagation Neural Network and Probabilistic Neural Network have been used as fault classifiers, to classify different transformer fault types according to the international standard IEC 60599. The following section describes the two fault classifiers briefly.

Artificial neural network maps the input samples and output in a nonlinear fashion. Backpropagation is one of the age-old learning algorithms and is used to train a multilayer feedforward neural network. Apart from the input and output layer, there is one hidden layer. The network has been trained for different numbers of neurons in the hidden layer. It has been found by trial and error that eighteen hidden neurons for this problem gave the best result. The networks are trained until the mean square error of the training samples fell below 0.005. In this paper, four different backpropagation learning algorithms, namely Gradient Descent, Levenberg–Marquardt, Conjugate Gradient and resilient backpropagation algorithms, have been compared for transformer fault classification. Levenberg–Marquardt algorithm was designed to attend second-order training speed without calculating Hessian Matrix. It has been proved that Levenberg–Marquardt training algorithm provides superior performance than conventional Gradient Descent algorithm. In conjugate Gradient Descent algorithm, the step size is adjusted in every iteration. In resilient backpropagation algorithm, only the sign of the derivative is used to update the weight. The magnitude of the derivative has no effect on the weight update [33].

Five key gas ratios are considered as input to the neural network, and six output codes are treated as the output of the neural network. ANN of (5 × 18 × 6) is designed, and backpropagation algorithm is used to train the neural network. The dataset is divided into two categories such as training set (50 %) and testing set (50 %). The network parameters for backpropagation algorithm include the following.

Gradient = 9.29 × 10−6, µ = 1 × 10−6, learning rate = 0.02, momentum factor = 0.8, number of neuron in hidden layer = 18 and tolerance = 0.005.

Figure 3 shows the error and epoch graph for different backpropagation learning algorithms. This figure gives a plot between the error value, i.e., the difference between the desired output and actual output, and the number of iterations it takes to reach to minimum error. From this figure, it is clear that Levenberg–Marquadt algorithm takes less number of iterations to converge to the required tolerance level. Figure 4 shows the regression (R) plot of different learning algorithms. It is plotted to measure the correlation between outputs and targets. An R value of one means a close relationship, zero a random relationship. The Regression Plot drawn in this figure shows that the value of R is closest to one indicating that output of the training network is quite close to the targets, when the dataset was trained with Levenberg–Marquardt.

Fig. 3
figure 3

Error and epoch graph of a Gradient Descent Method. Best Training Performance is 0.0099 at epoch 66. b Levenberg–Marquardt method. Best Training Performance is 0.099 at epoch 5. c Conjugate Gradient Descent Method. Best Training Performance is 0.0087 at epoch 16. d Resilient Method. Best Training Performance is 0.0097 at epoch 35

Fig. 4
figure 4

Regression Plot of a Gradient Descent Method. R = 0.97. Output = 0.94 × Target + 0.0022. b Levenberg–Marquardt method. R = 0.99. Output = 0.95 × Target + 0.032. c Conjugate Gradient Descent Method. R = 0.96. Output = 0.89 × Target + 0.020. d Resilient Method. R = 0.95. Output = 0.86 × Target + 0.004

There are some drawbacks of backpropagation training algorithm. It is too slow for practical applications, especially when too many hidden layers are employed. An appropriate selection of training parameters in the backpropagation algorithm is difficult and purely based on trial and error. There are many learning algorithms or modifications of the backpropagation algorithm in the literature but none of these methods are able to completely solve the problems associated with the backpropagation algorithm [34]. To overcome these drawbacks of backpropagation algorithms, a new neural network called PNN classifier is used in this article. In 1990, Specht introduced PNN architecture as a three-layer feedforward neural network architecture. The layers in PNN are input layer, pattern layer and summation layer. PNN is the neural network implementation of Parzen window kernel discrimination analysis. PNN is implemented using probabilistic model. Unlike backpropagation algorithm, it is bound to converge. No learning process is required for PNN, and there is no need to set weight [35, 36].

Parzen’s windowing estimation is given by

$$\varphi_{ki} \left( x \right) = \frac{1}{{\left( {2\pi } \right)^{{{\raise0.5ex\hbox{$\scriptstyle d$} \kern-0.1em/\kern-0.15em \lower0.25ex\hbox{$\scriptstyle 2$}}}} \sigma^{d} }}\sum\limits_{i = 1}^{m} {\exp \left[ { - \frac{{\left( {x - x_{ki} } \right)^{T} \left( {x - x_{ki} } \right)}}{{2\sigma^{2} }}} \right]}$$
(1)

Output of pattern layer is calculated as

$$\varphi_{ki} \left( x \right) = \frac{1}{{\left( {2\pi } \right)^{{{\raise0.5ex\hbox{$\scriptstyle d$} \kern-0.1em/\kern-0.15em \lower0.25ex\hbox{$\scriptstyle 2$}}}} \sigma^{d} }}\exp \left[ { - \frac{{\left( {x - x_{ki} } \right)^{T} \left( {x - x_{ki} } \right)}}{{2\sigma^{2} }}} \right]$$
(2)

where x ki is the neuron vector, σ is the smoothing parameter, d is the dimension of pattern vector x, φ ki is the output of pattern layer. T is the transpose of the distance between the neuron vector x ki and pattern vector x.

Output of summation layer for kth neuron is

$$p_{k} \left( x \right) = \frac{1}{{\left( {2\pi } \right)^{{{\raise0.5ex\hbox{$\scriptstyle d$} \kern-0.1em/\kern-0.15em \lower0.25ex\hbox{$\scriptstyle 2$}}}} \sigma^{d} N_{i} }}\exp \left[ { - \frac{{\left( {x - x_{ki} } \right)^{T} \left( {x - x_{ki} } \right)}}{{2\sigma^{2} }}} \right]$$
(3)

Here, N i is total number of samples in kth neuron.

The output of decision layer is

$$c\left( x \right) = \arg \hbox{max} \left\{ {p_{k} \left( x \right)} \right\}\quad k = 1,2,3,{ \ldots },m$$
(4)

m denotes the number of classes in training sample; c(x) is estimated class of the pattern x.

The accuracy of different BPN learning algorithms has been checked using confusion matrix. Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. The name stems from the fact that it makes it easy to see, if the system is confusing two classes (i.e., commonly mislabelling one as another).

Table 4 shows the confusion matrix for Backpropagation Neural Network and its various learning algorithms. Accuracy percentage, to classify different transformer fault types of the fault classification algorithms, i.e., Gradient Descent Algorithm, Levenberg–Marquardt algorithm, Gradient Descent Scaled Conjugate and Resilient, is 88.3, 93.6, 93.5 and 93 %, respectively. The accuracy percentage is higher, when LM method is used to classify different fault types of power transformer. Table 5 shows the confusion matrix for PNN and accuracy of fault classification using PNN comes out to be 95.6 %, which is an improved accuracy than backpropagation learning algorithms. To evaluate the performance of the classifiers to classify six fault classes, PNN is compared with different learning algorithms of backpropagation algorithm. The neural network is trained for 600 samples of training data (100 samples of each class). The network is further tested with 600 samples (100 samples of each class). Table 6 gives a comparative analysis of accuracy and regression among different learning algorithms of backpropagation method and PNN. From this table, it is seen that PNN is a better classifier than others. The Table 7 gives the classification results and compares the actual fault with the simulated fault, and from the comparison, it is concluded that PNN is a better classifier.

Table 4 Confusion Matrix of Backpropagation Neural Network showing fault classification results of different algorithms
Table 5 Confusion matrix of Probabilistic Neural Network showing fault classification results
Table 6 Comparison of regression and accuracy amongst different intelligent methods
Table 7 Comparison of the classification results with the actual faults

5 Conclusions

A comparative study of backpropagation algorithm and Probabilistic Neural Network algorithms has been carried out, to classify transformer faults, using these two algorithms. Highest accuracies are 95.6 % for Probabilistic Neural Network classifier and 93.6 % for Backpropagation Network classifier (Levenberg–Marquardt method). The findings show that Probabilistic Neural Network classifier outperforms the Backpropagation Network classifier. The proposed technique (Probabilistic Neural Network classifier) is capable of identifying the faults even if the dissolved gas ratio data lie outside the specified ranges defined by conventional ratio methods. From simulation point of view, early convergence, no learning process and no need to set weight are the added advantages, which make Probabilistic Neural Network a very useful fault classification tool for power transformers.