1 Introduction

Rapid progress of smart transmission and distribution systems with renewable energy integration equipped through power electronics switching devices, energy storage components, and digital control equipment exaggerates the issues related to power quality. It refers to any deviation/disturbance manifested in the electrical parameters such as voltage, current, and frequency from the standard rating. This problem may lead to failure or malfunctioning of equipment, increasing the risk of blackout, undesirable power loss, and the occurrence of instability in power system operation [1]. In addition to that PQ is becoming a relevant concern for service provider to supply high-quality power to consumers. Therefore, it is necessary to monitor and to mitigate for maintaining the PQ standard, particularly in complex and large power systems. In this regard, the development of new techniques and methodologies is inevitable on the detection and classification of PQ events. The detection of PQ events requires extracting the features from the disturbances to correctly characterize and analyze the signals efficiently. These sets of extracting features are used as the most important part of the generalized PQ event classification process [2,3,4]. With this motivation, the present study aims to formulate a new algorithm with the objective to classify accurately the PQ event signals.

Feature extraction is the primary step for the PQ event detection and classification. Generally, from the received non-stationary power signals, features are extracted to characterize the nature of the signals. This is done by following primarily two methods: first one through transform domain and the second one by extracting some statistical parameters of signal variation. The signal processing techniques are significantly able to obtain information even under noisy and disturbance conditions in time and frequency domain. Initially, the discrete Fourier transform (DFT), fast Fourier transform (FFT), and short-time Fourier Transform (STFT) are the most preferred techniques for time–frequency domain analysis because of their simplicity in implementation [5]. DFT is generally used for stationary PQ events, and it fails to sense the immediate variations in PQ events as the signals are non-stationary in nature. On the other hand, FFT fails to correctly compute amplitude frequencies and phases due to its effect of leakage, aliasing effects, and picket fence. Limited time and frequency resolution is the major limitation of the STFT approach, due to which it fails to be used extensively in PQ analysis. Although discrete wavelet transform (DWT) has a good time–frequency resolution, it experiences spectral leakage and picket fence [6]. Also, it is found that the DWT approach affected by the noise contamination. This in turn effects on detection requiring a higher level of decomposition and leads to the extra computational burden. Among other prominent signal processing approaches, the Gabor–Wigner transform (GT) is able to apply successfully for detection application due to its high signal-to-noise ratio and satisfactory time–frequency resolution [7]. Again for its limitation like complex computation depending on the sampling frequency, the use is restricted within the high-frequency applications. The Hilbert–Huang transform (HHT) is based on two distinct processes [8,9,10]. Firstly, the signal decomposition is performed through the empirical mode decomposition (EMD) technique into intrinsic mode functions (IMFs) to retrieve the valuable information associated with frequency and amplitude [11,12,13]. Subsequently, the Hilbert transform (HT) is useful to each IMF for computing the instantaneous amplitude and frequency with respect to time variation. The limitation like the limited scope for narrow-band applications needs to be handled for its fruitful function. The variable mode decomposition (VMD) is considered as another powerful signal processing tool for PQ analysis and detection due its decomposition capability of signals into various band-limited intrinsic mode functions (IMFs) [14, 15]. VMD can determine the related frequency bands and estimate the corresponding modes adaptively and continuously in a better way in comparison with EMD. However, the major limitations lie in the VMD application are boundary effects and sudden signal onset. Secondly, as the power signals are non-stationary in nature, VMD fails to work as the spectral bands of modes vary extremely with time [16]. Mathematical morphology (MM) is another nonlinear signal processing technique which is applied in many detection problems [17, 18]. The major advantage of the MM technique lies in its less computational burden and simple arithmetic and set theory calculation approach. However, the MM technique is yet to be exploited to justify its potential to detect accurately particularly with large data sets [19]. The S-transform attracts many researchers in recent times due to its ability to fully convertible from time domain to 2-D frequency translation domain and subsequently to the Fourier frequency domain. It is an extension to WT and STFT combination concept and based on a moving and scalable Gaussian window. The local spectral characteristics are represented completely by the amplitude–frequency–time spectrum and the phase–frequency–time spectrum. Apart from that, the real and imaginary components of the spectrum are possible to be localized by time through ST. The above factors play an important role and enhance performance in the detection and classification of PQ events [20,21,22]. Looking into the potential possibility of ST for detection through the relevant feature extraction after transformation of signal, it is considered in this study for further exploration for the PQ event detection.

Classification of several electrical disturbances by artificial intelligence (AI) classifier techniques is of interest to the electric power community in recent times. Even though neural network (NN)-based classifiers have high classification accuracy and provide mathematical flexibility, these methods fail to impress real-time application due to its demerits like: (1) less efficient under noisy conditions; (2) less convergence speed; and (3) dependency of accuracy on the system architecture and noise in the signal [23]. Support vector machine (SVM)-based classifiers arrive at very impressive results in many applications due to its advantages like: (1) having a high learning process; (2) capability to handle large features; and (3) providing a stable solution [6]. However, again like NN, learning is poor with minimal availability of features and in turn arrives at less accuracy. Among other AI techniques, fuzzy logic (FL)-based classifiers are applied to PQ events for its capability to model and analyze complex systems through the rule-based approach and degree of membership techniques [24]. Less adaptability on not considering new disturbances with existing topology is the major limitation. Apart from the above techniques, warping classifier, digital filtering and mathematical morphology, rule-based decision tree, and expert system-based approach play a vital role for PQ event classifiers [1,2,3,4]. Discussion on the review of all possible approach on PQ classifier gives an impression that till it needs further research to find a better classifier which adapting to new disturbance, lesser computational time, handling noisy signals, and higher accuracy. Looking into the above needful possibility, the present study focuses on extreme learning machine tuned with an evolutionary optimization approach-based classifier.

The improved WGO is formulated and used for determining the optimal ELM parameters, input weights, and hidden biases. The Moore–Penrose (MP) generalized inverse is applied to systematically calculate the output weights. The objective of modified WGO is devoted to reduce the norm of the output weights and to constrain the input weights and hidden biases within a reasonable range. This in turn enhances the classification accuracy and convergence in terms of computing time of ELM.

The major contribution of the study on power quality detection and classification is as follows:

  • Time–frequency analysis method based on S-transform (ST) is applied for feature extraction. Unlike other published articles based on ST, this approach considered as many as 13 types of features (not selectively) and ten PQ events generally occurred in real-time conditions.

  • Enhancing the searching strategy of WGO by a modified adaptive mutation-based approach. This novel step implements a variable mutation operator value which regulates the searching strategy during exploration and exploitation stage to result an improved performance.

  • Formulating a powerful evolutionary optimization algorithm for determining the optimal ELM parameters. This step enhances the performance of conventional ELM by optimally setting its parameters.

  • Justifying the proposed approach WGOELM by means of simulation considering almost all single and possible combined PQ disturbances. One of the major issues of real-time PQ monitoring system of simultaneous occurrence of multiple PQ events is emphasized in this study.

  • Establishing an approach to result with high accuracy and convergence rate for detection and classification of PQ events with noisy and disturbance conditions. In real-time monitoring condition, the accuracy of PQ event detection is always a major challenge under noisy signal cases. This problem is considered in the testing phase of the proposed approach.

The rest part of this manuscript presentation is structured into the following sections. In Sect. 2, the basic mathematical representation, feature extraction process, and detection capability of S-transform are presented. Theory and architecture of the WGO technique along with ELM for classification based on feature extracted in Sect. 2 are detailed in Sect. 3. To validate the performance of the proposed approach, the simulated results and comparisons with other existing methods are shown in Sect. 4. At last the findings from the obtained comparative results with future scope of this study are concluded in Sect. 5.

2 S-Transform

The ST can be considered as an expanded form of continuous WT (CWT). Frequency-dependent resolution of ST has a direct association with the Fourier spectrum. The signal information can be obtained from its phase and amplitude spectrum of a signal. To use the information present in the phase of CWT, it is essential for a phase correction in terms of modifying the phase of the mother wavelet. For a function \( h\left( t \right) \), its CWT can be presented as:

$$ W\left( {\tau ,d} \right) = \mathop \int \limits_{ - \infty }^{\infty } h\left( t \right)w\left( {d,t - \tau } \right){\text{d}}t $$
(1)

The width of the wavelet \( w\left( {d,t} \right) \) which is responsible for resolution control is denoted by the scale parameter \( d \). The ST of the function \( h\left( t \right) \) can be obtained from a CWT by multiplying a phase factor with a specific mother wavelet [20,21,22].

$$ S\left( {\tau ,f} \right) = {\text{e}}^{j2\pi ft } W\left( {d,\tau } \right) $$
(2)

The integrated mother wavelet used for ST can be expressed as follows:

$$ w\left( {t,f} \right) = \frac{\left| f \right|}{{\sqrt {2\pi } }}{\text{e}}^{{ - \frac{{t^{2} f^{2} }}{2}}} {\text{e}}^{ - j2\pi ft} $$
(3)

The frequency \( f \) is inversely proportional to \( d \). An admissible wavelet has to assure the zero mean condition. Under this constraint, (2) is not strictly a CWT function. This is due to the fact that the condition of an admissible wavelet is not satisfied by the wavelet expressed in (3). Considering this fact, the ST can be explicitly expressed as follows:

$$ S\left( {\tau ,f} \right) = \mathop \smallint \limits_{ - \infty }^{\infty } h\left( t \right)\frac{\left| f \right|}{{\sqrt {2\pi } }}{\text{e}}^{{ - \frac{{\left( {\tau - t} \right)^{2 } f^{2} }}{2}}} {\text{e}}^{ - j2\pi ft} {\text{d}}t $$
(4)

The ST can be formulated in terms of Fourier spectrum \( H\left( f \right)\;{\text{of}}\;h\left( t \right) \)

$$ S\left( {\tau ,f} \right) = \mathop \smallint \limits_{ - \infty }^{\infty } H\left( {\alpha + f} \right){\text{e}}^{{ - \frac{{2\pi^{2} \alpha^{2} }}{{f^{2} }}}} {\text{e}}^{j2\pi \alpha \tau } {\text{d}}\alpha ,\quad f \ne 0 $$
(5)

The discrete form of the power system disturbance signal can be denoted as \( h\left( {kT} \right),\quad k = 0,1, \ldots N - 1 \). Here, \( T {\text{and}} N \) denote the sampling time interval and the total sampling number, respectively. The \( h\left( {kT} \right) \) can be expressed as:

$$ H\left[ {\frac{n}{NT}} \right] = \frac{1}{N}\mathop \sum \limits_{k = 1}^{N - 1} h\left( {kT} \right) \cdot {\text{e}}^{{\left( { - \frac{j2\pi nk}{N}} \right)}} $$
(6)

where \( n = 0,1, \ldots ,N - 1 \)

With \( \left( {\tau \to kT\;{\text{and}}\;f \to n/NT} \right) \) and using (5), the ST of a discrete time series \( h\left( {kT} \right) \) can be expressed as:

$$ S\left[ {kT,\frac{n}{NT}} \right] = \mathop \sum \limits_{m = 0}^{N - 1} H\left[ {\frac{m + n}{NT}} \right]{\text{e}}^{{ - \frac{{2\pi^{2} m^{2} }}{{n^{2} }}}} {\text{e}}^{{\frac{j2\pi mk}{N}}} n \ne 0 $$
(7)

where \( k, m = 0,1, \ldots ,N - 1, \;{\text{and}}\;n = 1, \ldots ,N - 1. \) For n = 0,

$$ S\left[ {kT,0} \right] = \frac{1}{N}\mathop \sum \limits_{m = 0}^{N - 1} h\left( {\frac{m}{NT}} \right) $$
(8)

Initially, the discrete ST is computed through FFT and the convolution theorem. The major factor which attracts the ST for power quality analysis is the capability to localize both the phase spectrum and also the amplitude spectrum [22]. The ST amplitude matrix (STA) \( A\left( {kT,f} \right) = S\left[ {kT,n/NT} \right] \) is utilized to analyze the features characterizing the different power disturbances. In this process, higher time resolution is preserved at a higher frequency level with a lower time resolution at a low frequency. The rows and the columns of the matrix are the frequencies and time values with a dimension of N × M [22]. The rows of the matrix present the ST amplitude with all frequencies at the same instance of time. Similarly, columns of the matrix present the ST amplitude with time varying from \( 0\;{\text{to}}\;N - 1 \) in the same frequency. Here, \( n = 0,1, \ldots ,N/2 - 1 \). This is a multi-resolution approach with window width varying inversely with frequency and power data changed with respect to time. In this way, two major things are processed: One is a high time resolution at a high frequency, and the other is a high frequency resolution at a low frequency.

The following steps are followed for computing the discrete S-transform

  1. (1)

    Discrete Fourier transform is performed on \( h\left[ {kT} \right] \) with N points and sampling period T to obtain \( H\left[ {m/NT} \right] \)

  2. (2)

    The localized Gaussian \( G\left( {n,m} \right) \) is computed for the frequency \( n/NT. \)

  3. (3)

    The spectrum \( H\left[ {m/NT} \right] \) is shifted to \( H\left[ {\left( {m + n} \right)/NT} \right] \) for the frequency \( n/NT \).

  4. (4)

    The \( H\left[ {\left( {m + n} \right)/NT} \right] \) is multiplied by \( G\left( {n,m} \right) \) to obtain \( B\left[ {n/NT,m/NT} \right] \).

  5. (5)

    The inverse Fourier transform is performed on \( B\left[ {n/NT,m/NT} \right] \) to obtain the row of \( S\left[ {n/NT,jT} \right] \)

  6. (6)

    Steps 3, 4, and 5 are repeated till all the rows of \( S\left[ {n/NT,jT} \right] \) have been defined.

3 Extreme Learning Machine (ELM)

The ELM is a type of single-hidden layer feed-forward neural networks (SLFNs) [25]. It is a learning algorithm containing three-layer architecture as demonstrated in Fig. 1. The structure is designed with n, l, and m denoting number of input neurons, number of hidden neurons, and number of output neurons, respectively.

Fig. 1
figure 1

ELM architecture

Let us consider there are N training samples \( x_{i} ,t_{i} \), where \( x_{i} = \left[ {x_{i1} ,x_{i2} ,x_{i3} , \ldots ,x_{in} } \right]^{T} \in R^{n} \), \( t_{i} = \left[ {t_{i1} , t_{i2 } \ldots , t_{im} } \right]^{T} \in R^{m} , \)\( \left( {i = 1,2, \ldots ,N} \right) \). The output function of standard ELM for l hidden nodes is expressed as in (9).

$$ f\left( x \right) = \mathop \sum \limits_{i = 1}^{l} \beta_{i } h_{i} \left( x \right) = h\left( x \right)\beta $$
(9)

The hidden layer h output can be expressed in terms of functional representation as in Eq. (10).

$$ h = G\left( {ax + b} \right) $$
(10)

Similarly, the output neuron result can be presented as follows.

$$ h\left( {x_{i} } \right)\beta = t_{i}^{T} ,\quad i = 1,2, \ldots ,N $$
(11)

where \( G\left( x \right) \) denotes the hidden layer activation function. The ‘a’ indicates the weight matrix corresponding to the connections between input neuron and hidden neuron. The ‘\( \beta \)’ indicates the weight matrix corresponding to the connection between hidden neuron and output neuron. The ‘b’ denotes the weight corresponding to hidden neuron bias input. Considering Eqs. (10) and (11), Eq. (12) can be rewritten as follows [26].

$$ H\beta = T $$
(12)

where H denotes the output matrix of the hidden layer, T denotes the training vector target matrix, and \( \beta \) denotes the weight matrix corresponding to the connection between hidden and output neurons. The H can be presented in matrix form as:

$$ H = \left[ {\begin{array}{*{20}c} {G\left( {a_{1 } ,b_{1} ,x_{1} } \right)} & \cdots & {G\left( {a_{l} ,b_{l} ,x_{l} } \right)} \\ \vdots & \ddots & \vdots \\ {G\left( {a_{1} ,b_{1} ,x_{N} } \right)} & \cdots & {G\left( {a_{l} ,b_{l} ,x_{N} } \right)} \\ \end{array} } \right]_{N \times l} ,\quad \beta = \left[ {\begin{array}{*{20}c} {\beta_{1}^{T} } \\ \vdots \\ {\beta_{l}^{T} } \\ \end{array} } \right]_{l \times m} ,\quad T = \left[ {\begin{array}{*{20}c} {t_{1}^{T} } \\ \vdots \\ {t_{N}^{T} } \\ \end{array} } \right]_{N \times m} $$

Applying the concept of least square method, the output weights are computed as in (13).

$$ \beta^{\prime} = H^{ + } T $$
(13)

where \( H^{ + } \) represents the Moore–Penrose inverse [27] of the matrix H. The \( a\;{\text{and}}\;b \) can be generated randomly. For applying ELM for any application, the number of hidden neurons \( l \) and the activation function G need to be considered before training. The ELM is significantly faster compared to other contemporary neural networks due to fast evaluation of β in a single iteration.

3.1 Wild Goat Optimization

A novel swarm intelligent algorithm based on wild goats’ climbing nature is introduced by Shefaei et al. [28]. This wild goat optimization (WGO) algorithm is inspired by the living in groups, cooperation between members of groups, and movement by following the leader nature, to reach the highest point of the mountain. WGO in comparison with other evolutionary and classical optimization techniques has many important features that motivate to apply in this present study [28] and are as follows.

  • Simple in approach due to derivative-free computation.

  • Flexible to implement with other techniques to form hybrid approach.

  • Less sensitive to the objective function types, i.e., convexity or continuity.

  • Has less parameters on which the performance depends on.

  • Ability to escape and not to be trapped to local minima.

  • Simple to implement and coding in terms of basic mathematical and logic operations.

  • Can able to solve the objective functions with stochastic nature.

  • Final results are independent to the random initialization of solution chosen to start its iteration.

The mathematical expression with stepwise representation of WGO is briefly explained as follows.

  • Step 1 Initialization

Like other stochastic-based evolutionary algorithms, initially a population matrix \( {\text{wg}}_{ij} \) representing each row as one solution is randomly created.

$$ \left( {\text{wg}} \right)_{{N_{{{\text{wg}} \times }} N_{\text{var}} }} = \left[ {x_{ij} } \right] $$
(14)

where \( N_{\text{wg}} \) denotes the number of wild goats referred to the number of solutions of the problem, \( N_{\text{var}} \) denotes the dimension of the problem referred to as number of variables, \( i = 1,2, \ldots .,N_{\text{wg}} \), and \( j = 1,2, \ldots .,N_{\text{var}} \)

The fitness value of each \( N_{\text{wg}} \) is evaluated according to the objective function, which refers to the objective of the problem. This can be represented as follows:

$$ f\left( {{\text{wg}}_{i} } \right) = f\left( {\left[ {x_{ij} } \right]} \right) $$
(15)

The objective function values are normalized according to the electromagnetism (EM) algorithm as represented in (16) to assign a weight \( {\text{WT}} \) or each \( N_{\text{wg}} \) solution.

$$ {\text{WT}}_{i} = \exp \left( { - N_{\text{var}} \frac{{f\left( {{\text{wg}}_{i} - \hbox{min} \left\{ {f\left( {{\text{wg}}_{j} } \right)} \right\}} \right)}}{{\mathop \sum \nolimits_{j = 1}^{{N_{\text{wg}} }} \left( {f\left( {{\text{wg}}_{j} } \right) - \hbox{min} \left\{ {f\left( {{\text{wg}}_{j} } \right)} \right\}} \right)}}} \right) $$
(16)

where \( i,j = 1, \ldots ,N_{\text{wg}} \) and \( {\text{WT}} \) represents the weight of each \( {\text{wg}} \). According to the value of \( {\text{WT}} \) of each \( {\text{wg}} \), the number of leaders \( N_{\text{l}} \) and the number of followers \( N_{\text{f}} \) are decided according to the relation \( N_{\text{f}} = N_{\text{wg}} - N_{\text{l}} \). The number of groups formed \( N_{g} \) is equal to \( N_{\text{l}} \). The leaders are selected according to the concept of survival of the fittest and judged according to the weight assigned to the each \( {\text{wg}} \), i.e., the highest value of \( {\text{WT}} \). The follower member of each group is selected proportionally according to the weight of the leader in that group. Each group is initially formulated with one leader and number of followers as follows.

$$ \left\{ {\begin{array}{*{20}c} {{\text{GR}}_{1} = \left\{ {{\text{wg}}_{1} ,..,{\text{wg}}_{{N_{G1} }} } \right\}} \\ . \\ . \\ . \\ {{\text{GR}}_{{N_{\text{g}} }} = \left\{ {{\text{wg}}_{{N_{{G_{{N_{{{\text{g}} - 1}} }} }} }} + 1, \ldots ,{\text{wg}}_{{N_{j} }} } \right\}} \\ \end{array} } \right. $$
(17)

where \( {\text{GR}}_{i} \) indicates the \( i \)th group.

  • Step 2 Movement within groups

A movement vector \( vv \) is defined for each \( {\text{wg}} \) of wild goats. However, the movement value for leaders and followers is decided differently. For the leaders, the movement is decided according to the best leader weight value \( (P_{\text{lbest}} ) \) and its own weight value as represented in (18).

$$ vv_{i} \left( {t + 1} \right) = w \times vv_{i} \left( t \right) + R \times {\text{rand}} \times (P_{\text{lbest}} \left( t \right) - {\text{wg}}_{i} \left( t \right)) + c_{i} \left( t \right) $$
(18)

where \( i = 1,2, \ldots .,N_{\text{g}} \). The \( w \) and \( R \) are the inertia weight and personal learning coefficient, respectively. \( {\text{rand}} \) is a random number in between 0 and 1. \( P_{\text{lbest}} \) is the best leader among all leaders. The \( c \) and \( d \) are auxiliary parameters computed as follows.

$$ c_{i} \left( t \right) = \mathop \sum \limits_{i = 1}^{{N_{\text{g}} }} {\text{rand}}_{i} \times d_{i} \left( t \right) $$
(19)

where \( i = 1, \ldots ,N_{\text{g}} \)

The \( d \) is computed as:

$$ d_{i} \left( t \right) = \left\{ {\begin{array}{*{20}l} {W_{j} \left( t \right) \times \left( {{\text{wg}}_{j} \left( t \right) - {\text{wg}}_{i} \left( t \right)} \right);} \hfill & {{\text{if}}\;W_{i} \left( t \right) < W_{j} \left( t \right),\;j \ne i} \hfill \\ {0,} \hfill & {{\text{Otherwise}}} \hfill \\ \end{array} } \right. $$
(20)

where \( i,j = 1, \ldots ,N_{\text{g}} \). However, the movement of the followers is regulated based on the group leader and all other follower members of the group with higher weight. In other words, the movement of each \( {\text{wg}} \) within each group has three components. Its own movement vector, its best attempt, and the followers of this group are three major factors which affect the computation of movement vector. The followers movement within a group is computed as represented in (21).

$$ \begin{aligned} vv_{i} \left( {t + 1} \right) & = w \times vv_{i} \left( t \right) + R \times {\text{rand}} \times (P_{\text{lbest}} \left( t \right) - {\text{wg}}_{i} \left( t \right)) + W_{lk} \left( t \right) \times {\text{rand}} \\ & \quad \times \left( {{\text{wg}}_{lk} \left( t \right) - {\text{wg}}_{i} \left( t \right)} \right) + c_{i} \left( t \right) \\ \end{aligned} $$
(21)

Like (19) and (20) in case of movement vector calculation for the leader, the \( c \) and \( d \) auxiliary parameters for the followers movement vector are computed as follows.

$$ c_{i} \left( t \right) = \mathop \sum \limits_{i = 1}^{{N_{g} }} {\text{rand}}_{i} \times d_{i} \left( t \right) $$
(22)
$$ \begin{aligned} & d_{i} \left( t \right) = \left\{ {\begin{array}{*{20}l} {W_{j} \left( t \right) \times \left( {{\text{wg}}_{j} \left( t \right) - {\text{wg}}_{i} \left( t \right)} \right);\quad {\text{if}}\;W_{i} \left( t \right) < W_{j} \left( t \right),\;j \ne i} \hfill \\ {0,\quad {\text{Otherwise}}} \hfill \\ \end{array} } \right. \\ & \quad \quad \quad \quad i,j = 1, \ldots ,N_{g} \\ \end{aligned} $$
(23)

where k denotes the index of the groups. The new position of each \( {\text{wg}} \) is computed by adding the movement vector with its present position as follows.

$$ {\text{wg}}_{i} \left( {t + 1} \right) = {\text{wg}}_{i} \left( t \right) + {\text{vv}}_{i} \left( {t + 1} \right), i = 1, \ldots ,N_{\text{wg}} $$
(24)
  • Step 3 Reevaluating

After computing the new movement vector and position for all \( {\text{wg}} \) including leaders and followers, the step 1 to step 2 are iteratively followed. In the new position of \( {\text{wg}} \) signifying new solution in search space, a leader can be degraded to followers and also the follower may be the leader.

  • Step 4 Groups Cooperation and Mutation of Young Goats

The mutual influence of each groups’ experience in terms of knowledge sharing is considered by modifying the weight of each group as illustrated in (25).

$$ {\text{WT}}_{{{\text{G}}_{i} }} \left( t \right) = \frac{{{\text{WT}}_{{l_{i} }} \left( t \right) + \mathop \sum \nolimits_{j = 1}^{{N_{{G_{i} }} }} {\text{WT}}_{j} \left( t \right)}}{{1 + N_{{G_{i} }} }} , i = 1, \ldots ,N_{\text{g}} $$
(25)

According to that group weight value \( {\text{WT}}_{\text{G}} \), one group may attract the followers from other group if its \( {\text{WT}}_{\text{G}} \) value of that group is higher than the other groups. Similarly, one group may lose the followers if its \( {\text{WT}}_{\text{G}} \) value is lower than the other groups. This idea of WGO is presented in (26).

$$ G_{i} = \left\{ {\begin{array}{*{20}l} {G_{i} + \left\{ {{\text{wg}}_{k} \left| {k \in j,\mathop {\hbox{min} }\limits_{{{\text{wg}} \in j}} \left\{ {\text{WT}} \right\} = {\text{WT}}_{k} ;\quad {\text{if}}\;{\text{WT}}_{{{\text{G}}_{i} }} \left( t \right)} \right\rangle {\text{WT}}_{{{\text{G}}_{j} }} \left( t \right),\quad j \ne i} \right\}} \hfill \\ {0,\quad {\text{Otherwise}}} \hfill \\ \end{array} } \right. $$
(26)

where \( i,j = 1, \ldots ,N_{\text{g}} \). A mutation operator is adopted to escape from premature convergence. The mutation percentage m is assumed such that the number of mutated wild goats in every iteration is less than the number of groups. The mutation operator can be considered in evaluating the new position of \( {\text{wg}} \) as presented in (14).

$$ {\text{wg}}_{i} \left( {t + 1} \right) = \left\{ {\begin{array}{*{20}l} {{\text{wg}}_{i} \left( t \right) + {\text{rand}} \times \left( {{\text{rand}} - {\text{wg}}_{i} \left( t \right)} \right);\quad {\text{if}}\quad {\text{rand}} \times \frac{1}{{{\text{WT}}_{{{\text{G}}_{i} }} \left( t \right)}} \times \frac{1}{{{\text{WT}}_{j} \left( t \right)}} > m} \hfill \\ {0, \quad {\text{Otherwise}}} \hfill \\ \end{array} } \right. $$
(27)

where \( i = 1, \ldots ,N_{\text{g}} ;\quad j = 1, \ldots ,N_{{{\text{G}}_{i} }} \)

  • Step 5 Moving Toward Reaching One Group

Maximum number of iterations is taken as the stopping criteria in this algorithm. At last the best group leader is considered as the one to reach the highest point referred to the best solution for the optimization problem.

3.2 Modified Adaptive Mutation Strategy

For any mutation-based evolutionary optimization technique, adaptively changing mutation factor is always fruitful to enhance the convergence property and find better optimum accuracy. Furthermore, the mutation strategy variation balances both the advantages of exploration and exploitation of the search space. The proposed time-varying mutation process significantly reduces the chance of not arriving at global optimum value, particularly in the high-dimensional problem application. This also makes the searching faster. The idea of proposing adaptive mutation strategy is motivated from the fact that the constant mutation factor may be good for exploration and later may act poorly for exploitation stage of searching. The time-varying mutation strategy can be expressed as:

$$ m = 1 - m^{\prime} $$
(28)

where \( m^{\prime} \) is the ratio of the current-generation iteration number to the maximum iteration number. Computationally with this strategy, the balance of exploration and exploitation will be executed for an enhanced optimal solution. Higher values initially give more weight to exploration searching movement, and lower value in later stage accelerates the exploitation searching movement.

3.3 Evolutionary ELM Model

The ELM network is very often found to perform better in many applications due to selecting many parameters randomly or hit and trial and approaches like input weights, hidden biases, and number of hidden neurons [29, 30]. In this proposed approach, the WGO algorithm is applied to tune the input weights and hidden neuron biases. In this hybrid approach, initialization of the solution space contains all input weights and hidden biases as represented in (29).

$$ {\text{wg}}_{i} = \left[ {w_{i1} ,w_{i2} , \ldots ..w_{iL} ,b_{i1} ,b_{i2} , \ldots .b_{iL} } \right] $$
(29)

where the solution vector has the dimension of \( D = \left( {d + 1} \right) \times L \). All \( w_{ij} \) and \( b_{j} \) values are assumed randomly within [− 1, 1] range. After initialization, the parameters of ELM are tuned by evaluating output weights for each solution by using (13). Like for any classification problem, the major objective for ELM parameter tuning is to reduce the number of misclassified patterns. The ith solution fitness value in solution space is calculated according to Eq. (30).

$$ {\text{fitness}}_{i} = \frac{1}{{1 + {\text{MP}}_{i} }} $$
(30)

where \( {\text{MP}}_{i} \) denotes the total number of misclassified patterns acquired for ith solution. This fitness value needs to be maximized for better solution performance. However, the third factor which affects the performance of the ELM classifier is the number of hidden neurons. In this study, the number of hidden neurons is gradually increased and one of the best is considered for testing phase. It has been found after a minimum number of hidden neurons, and further increase does not give any substantial improvement in the result. It may be due to saturation or over-fitting characteristics of the network.

3.4 Proposed ST–ELM-Based PQ Event Recognition Approach

The proposed approach is based on two major techniques: ST and ELM. The ST is used for computing the feature extraction vector from various PQ signals. From the STA matrix obtained as output from the S-transform, the features are computed by applying various statistical techniques. The performance of the ELM is further enhanced by optimally computing the parameters for enhancing the capability as a classifier. The WG optimization technique is used for obtaining the optimal values of the ELM parameter. This hybrid approach is named as WGO-based ELM (WGOELM) classifier. By taking those feature vectors, WGOELM is used to detect and classify according to the characteristics interpreted through various features. The detail process of feature extraction stage by using S-transform is presented in Sect. 3.5. Section 3.6 details about the classification stage by using the WGOELM.

3.5 Feature Extraction Stage Using S-Transform

The ST of power quality event (PQE) signals can be pictorially visualized in terms of time–frequency, time–amplitude, and amplitude–frequency. Looking into the real-time conditions, a wide range of PQEs is considered for this study. The PQEs are named as various classes CL1, CL2, CL3, CL4, CL5, CL6, CL7, CL8, and CL9 for pure signal, voltage sag, voltage swell, interruption, flicker, oscillatory transient, harmonics, sag with harmonics, and swell with harmonics, respectively.

Figures 2, 3, 4, 5, 6, 7, 8, 9, and 10 demonstrate the standard signal along with other PQE signals undertaken for this study. Along with that, time–frequency, time–amplitude, and frequency–amplitude plots are shown, derived from the S-transform amplitude matrix (STA). The ideal standard voltage signal is shown in Fig. 2. The disturbance signals indicating the poor power quality of normal signal in various ways are presented in Figs. 3, 4, 5, 6, 7, 8, 9, and 10a. Time–frequency contours (TF contour) for different PQE cases are demonstrated in Figs. 2, 3, 4, 5, 6, 7, 8, 9, and 10b indicating the normal frequency versus time of STA. Time–maximum amplitude plots (TmA plots) are presented for various cases in Figs. 2, 3, 4, 5, 6, 7, 8, 9, and 10c, indicating the maximum amplitude versus time retrieving from the STA matrix columns at the fundamental frequency. The STA at the fundamental frequency is also shown in the TmA plot. The amplitude of a standard voltage signal is constant, and it is depicted in Fig. 2c. The frequency–maximum amplitude plots (FmA plots) for various PQE cases are demonstrated in Figs. 2, 3, 4, 5, 6, 7, 8, 9, and 10d, indicating the variation in maximum amplitude versus normalized frequency derived from the STA matrix rows at different frequencies. The normalization of frequency amplitude (f) has been done by f/fs, where fs denotes the sampling frequency. The information about the disturbance frequency components and their corresponding maximum amplitude is shown in all the figures in terms of pictorial representation. The presence of only one peak in the FmA plot indicates the existence of a fundamental frequency component in the signal. The frequency–standard deviation plots (Fstd plots) are shown in Figs. 2, 3, 4, 5, 6, 7, 8, 9, and 10e, indicating the standard deviation versus normalized frequency, and it is derived from the rows of the STA matrix at all frequencies.

Fig. 2
figure 2

Normal voltage

Fig. 3
figure 3

Voltage sag

Fig. 4
figure 4

Voltage swell

Fig. 5
figure 5

Voltage interruption

Fig. 6
figure 6

Voltage flicker

Fig. 7
figure 7

Voltage transient

Fig. 8
figure 8

Voltage harmonics

Fig. 9
figure 9

Voltage sag with harmonics

Fig. 10
figure 10

Voltage swell with harmonics

Features are extracted through the application of standard statistical techniques to the signal data as presented in contours and various corresponding PQE plots of the STA matrix. These features characterize the PQE signals and are essential for detection and classification by finding and quantifying the related significant parameters of the different signals [31]. Different possible features are extracted for all types of PQEs and are represented as follows.

  • Type-1 TF contour Features

The major features are extracted named as Fp1, Fp2, and Fp3, indicating standard deviation having the largest frequency amplitude, mean of contour having the largest frequency amplitude, and energy having the largest frequency amplitude of TF contour, respectively.

  • Type 2 TmA plot Features

The major features are extracted in this category named as Fp4, Fp5, Fp6, Fp7, Fp8, Fp9, Fp10, Fp11, and Fp12, from the TmA plot. The features like maximum of TmA plot, minimum of TmA plot, mean of TmA plot, and standard deviation of TmA plot features are denoted as Fp4, Fp5, Fp6, and Fp7, respectively. The amplitude factor of \( {\text{Apf}} \) is denoted as Fp8 feature. The \( {\text{Apf}} \) is computed as:

$$ {\text{Apf}} = \frac{1}{{F_{pN} }}\left( {1 + \left( {Fp4 + Fp5 - \left( {F_{pN} } \right)} \right)} \right) $$
(31)

where \( F_{pN} \) is Fp4 + Fp5. This is for a normal voltage signal (undisturbed case).

The other features Fp9, Fp10, Fp11, and Fp12 denote standard deviation of \( P_{r} \), (max \( \left( {P_{r} } \right) \), min \( \left( {P_{r} } \right) \) skewness of \( P_{r} \), and kurtosis of \( P_{r} \), respectively, of the TmA plot. The \( P_{r} \) denotes the TmA plot for frequencies larger than the four times of the fundamental frequency.

  • Type 3 FmA plot Features

The total harmonic distortion (THD) of FmA plot is considered as one of the features named as Fp13. The THD can be expressed as:

$$ {\text{THD}} = \frac{{\sqrt {\mathop \sum \nolimits_{n = 2}^{N} V_{n}^{2} } }}{{V_{1} }} $$
(32)

where N denotes the number of points considered in the FFT analysis.

  • Type 4 Fstd plot Features

The mean of square root (Msr) of the Fstd plot is considered as another feature named as Fp14. Mathematically, the Msr can be computed as in Eq. (33).

$$ M_{\text{sr}} = {\text{mean}}\left( {\sqrt {{\text{std}}\left( {\text{STA}} \right)} } \right) $$
(33)

The mathematical formula for the statistical parameters is given as follows:

$$ {\text{Mean}},\;\bar{x} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} x_{i} $$
(34)
$$ {\text{Standard}}\;{\text{deviation}},\sigma = \sqrt {\left( {\frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left( {x_{i} - \bar{x}} \right)^{2} } \right)} $$
(35)
$$ {\text{Skewness}},\;\sigma_{\text{skew}} = \frac{1}{{\left( {N - 1} \right)\sigma^{3} }}\mathop \sum \limits_{i = 1}^{N} \left( {x_{i} - \bar{x}} \right)^{3} $$
(36)
$$ {\text{Kurtosis}},\;\sigma_{\text{kurt}} = \frac{1}{{\left( {N - 1} \right)\sigma^{4} }}\mathop \sum \limits_{i = 1}^{N} \left( {x_{i} - \bar{x}} \right)^{4} $$
(37)

where \( x_{i} \) denotes the considered discrete variables and N denotes the number of data points.

3.6 Classification Phase

The proposed WGOELM approach is applied and tested as a classifier on the PQE signals by considering their extracted features. The WGOELM-based classifier is found to be very fast due to lack of iterative approach of ELM and higher generalization ability due to fine tuning for optimal parameter setting. The obtained features of all set of PQEs through the S-transform are used as the possible set for classification by WGOELM. The classifier with these feature vector data sets detects and classifies the power quality events. The performance and accuracy of the proposed approach have been tested and presented in the Results section.

4 Results

To access the performance of the proposed ST–WGOELM-based PQ event detection technique, four different classification procedures are followed. For testing the proposed approach, synthetic event data are collected along with noisy and hybrid PQ events. The different parameters for WGO are considered as \( N_{\text{wg}} = 100, N_{\text{g}} = 10 \) in initialization stage, \( w = 0.7298 \), R = 1.4962, m = 0.95 (for conventional WGO). For the proposed work, the mutation strategy is made time-varying according to the suggested mechanism in Sect. 3.2. To match the real-time conditions, the noisy data have been collected with Gaussian white noise of 20, 30, and 50 dB signal–noise rates (SNR) as noise added to the PQ event data.

The SNR can be represented as:

$$ {\text{SNR}} = 10{ \log }\left( {\frac{{P_{s} }}{{P_{n} }}} \right){\text{dB}} $$
(38)

where \( P_{\text{s}} \) and \( P_{\text{n}} \) denote the power of the signal (variance) and noise, respectively.

Apart from that as very often in real-time conditions, PQ events may occur at the same time, mixed PQ event’s data are also collected through MATLAB simulation. After extraction of wide range and different types of PQ event signals, ST transform is applied to compute the distinctive features characterizing the signals. These features are considered as the input to the WGOELM-based classifier to detect accurately the type of PQ event.

The detection and classification accuracy of the proposed approach has been tested on 900 synthetic signals of power quality events. The synthetic signals are extracted through MATLAB programming and simulation. Considering the Shannon’s sampling theorem, the sampling frequency is taken as 3.2 kHz in this study which is greater than twice the highest frequency of the signal (960 Hz) [32, 33]. The simulation sample time (step size) is 0.3125 ms. Referring to the standard percentage, 60% and 40% of the data set are considered for the training and testing, respectively.

4.1 Classification Process 1

The number of neurons of a hidden layer is considered as 30, and this is considered by taking the least value of neurons for which the maximum accuracy is achieved. The activation function for the neurons is considered as sigmoidal function due to its higher accuracy rate in many engineering applications. In Tables 1 and 2, the detailed performance evaluation is shown by using training and testing data. The overall accuracy of the proposed approach is 99.68% in comparison with ST with ELM-based approach [32] as 99.19%.

Table 1 Classification process 1 results
Table 2 Confusion matrix

4.2 Classification Process 2

PQ event signals are added with a noise factor of 20 dB to synthetic data for testing similar to real-time conditions. According to the standard percentage taken in many papers, 60% and 40% of the data are considered for training and testing. The performance results of the proposed WGOELM-based classifier are shown in Tables 3 and 4. The results indicate a very high performance of the proposed WGOELM- and S-transform-based detection and classification technique under this condition. As like before, in this case also with similar conditions, the overall accuracy of the proposed approach is 99.68% in comparison with ST with ELM-based approach [32] as 99.67%.

Table 3 Classification process 2 results
Table 4 Confusion matrix

4.3 Classification Process 3

PQ event signals are added with a noise factor of 30 dB to synthetic data for testing similar to real-time conditions. The results are presented in Tables 5 and 6. The results reflect a very high accuracy of the proposed WGOELM-based recognition technique under this condition. Due to tuning of ELM parameters by WGO, the overall accuracy of the proposed approach is 99.84% in comparison with ST with ELM-based approach [32] as 99.67%.

Table 5 Classification process 3 results
Table 6 Confusion matrix

4.4 Classification Process 4

In this process, the performance evaluation of the proposed approach has been done with 50 dB noise added to the PQEs signals. The results are presented in Tables 7 and 8. Similarly, 30 numbers of hidden neurons and the same sigmoidal activation function are taken as like other cases. The results obtained are better even under noisy conditions. The overall accuracy of the proposed approach is 100% in comparison with ST with ELM-based approach [32] as 100%. This is due to enhancing the classification ability of the ELM by WGO technique.

Table 7 Classification process 2 results
Table 8 Confusion matrix

4.5 Performance Evaluation and Discussion

This study devotes to find a better approach to recognize the power quality disturbances. Evaluating the results of the proposed approach based on S-transform- and wild goat optimization-based extreme learning machine seems to be acceptable for real-time application due to its high performance under different noisy conditions and for various power quality events. The process 1 is tested under normal PQEs considering all power quality conditions. The results justify the capability of recognition and detection of PQ-based signals. The processes 2, 3, and 4 tested the proposed approach under 20 dB, 30 dB, and 50 dB noise distortion, respectively, looking for the similar real-time PQ signals under various nonlinear loads and nonlinear devices presence. The results reflect the robustness of the proposed WGOELM-based recognition system for power quality events under normal and noisy conditions. In all cases, the proposed approach arrives at a better result in comparison with the ST with ELM approach [32]. A comparison of results based on classification accuracy of the proposed approach with other existing methods is presented in Table 9. The results reveal that the proposed approach fairly gives better accuracy in comparison with other ST-based techniques.

Table 9 Classification accuracy comparison

5 Conclusion

A novel ST–WGOELM-based approach is suggested for the automatic recognition of PQ events. By segmentation and normalization of PQ event signals through ST analysis, distinctive and prominent feature vectors are obtained as a characterization of the different types of signals. The obtained feature vectors are used as input to the WGO–ELM-based classifier to detect the classes these events belong to. To improve the performance of the ELM-based classifier, WGO technique is applied to tune the ELM parameters like network input weights and hidden biases. To justify the approach as an initial study toward the real-time application, it is tested to the similar kind of synthetic individual and hybrid data set along with three different noisy conditions (20, 30, and 50 dB).

The major findings of the study for the proposed novel approach for PQ event detection toward its possibility of real-time applications are: (1) The performance of the ELM classifier is enhanced by tuning its weight and bias parameter through a simple and robust optimization technique based on WG algorithm. (2) Optimally tuned ELM classifier computationally takes less time and can be preferred over other neural network techniques in terms of speed. (3) ST- and WGO–ELM-based hybrid approach arrives at better results as compared to other approaches suggested by various researchers in recent times. (4) Similar real-time conditions are carefully tested through noisy and hybrid PQ event data.

About the futuristic addition to the present study, it can be further improved by using the modified ST technique, by considering inverter and without inverter-based renewable energy sources, storage devices, and electric vehicle-integrated micro-grid test system. There are many challenges need to be focused in future for PQ research in real-time detection and classification. Although many research studies have been done in recent past based on the single-phase data, there is an urgency to focus on real-time three-phase power quality disturbances. Apart from that, new indicators are to be formulated to characterize particularly the transient signals and to extract better relevant features which in turn affect to a simple and improved classification. In addition to that, new methodologies are to be proposed to include noise immunity, the presence of simultaneously occurred multiple events and blackout conditions due to PQ events.