1 Introduction

Today’s current structural engineering industry requires consideration to be directed towards structural health monitoring (SHM) and optimizing safety. With forecasts of increasing worlds’ population, structural infrastructure shall be subject to increased loading and deformation. To decrease the effects and consequences of structural deterioration, SHM processes are required more frequently, with high levels of accuracy necessary to achieve asset preservation. Hence, there has been a surge in interest surrounding SHM and the development of automated defect evaluation systems in an attempt to maintain existing structural networks and allow for asset expansion.

Concerning structural behavior, damage leads to deviations in the structure’s dynamic characteristics and is considered a reliable indication of anomaly diagnosis. Also, it might cause a system with typically linear behavior to demonstrate nonlinear responses, including cracking, impacts and rattling, delamination, stick or slip, rub, or deformation in connections [1, 2]. Nonlinear behavior is supposed to be unpredictable and more sophisticated compared to the linear one. As a case in point, it has been proven through experimental investigation that natural frequencies could rise instead of decrease in breathing phenomena [3]. This reaction originates from the fact that the crack conversely opens and closes in the experimental test. Subsequently, the detection of nonlinear anomalies is considered more challenging compared to linear damage [4].

Over the decades, researchers have proposed several techniques in terms of anomaly identification. Generally speaking, such methods are divided into physics-based (or model-based) and data-driven approaches [5]. In the physics-based, anomalies are tracked utilizing monitoring variations within the simulated responses from the structural numerical model [6]. This model is a detailed mathematical abstraction linking a studied system’s input and output variables employing known or presumed properties [7]. Post analysis is demanded for determining damage location and qualification. Finite-element methods (FEMs), boundary element methods (BEMs), and spectral finite-element methods (SFEMs) are some of the techniques used in this regard. However, FEMs are considered the systematic method compared to the others due to their compliance in modeling complicated structures [8]. In the occurrence of damage, particular parameters of the simulated models are updated according to response measurements. Optimization algorithms are typically used to minimize variations between experimental and numerical responses by comparing mechanical characteristics of stiffness, damping, or mass [6].

Despite the broad potential of physics-based approaches in damage assessment, especially for the evolution of complex systems such as multi-stories buildings and multi-span bridges, they have some limitations. For example, exact modeling of a structure entails sufficient information regarding different components of a monitored system, such as loading states, boundary conditions, material properties, and precise coordinates of members. Moreover, optimization solutions commonly experience numerical instability as well as ill-conditions dilemma [9]. The performance of such optimization techniques substantially degrades proportionally to the number of variables in the problem.

On the other side, data-driven SHM provides bottom–up solutions founded on tracking changes within the output signals appropriate for complex systems where the knowledge about geometries, properties, and initial conditions is limited [5]. Any sudden changes in the output signals are observed and analyzed through signal processing tools and pattern recognition procedures to determine probable damage. Independence for having an initial model and prior knowledge causes data-driven SHM to be a faster technique and an economical and practical solution for online SHM. Signal processing techniques synthesize, modify and analyze the recorded responses, and highlight different features in time, frequency, and frequency domains. Machine Learning algorithms are typically employed to identify and interpret features extracted from signals and recognize generated patterns in conjunction with such methods. Machine learning includes clustering, regression, neural networks, ensemble learning, deep learning, Bayesian methods, instance-based, decision trees, and dimensionality reduction [10].

Data-driven methods are helpful compared to physics-based techniques when [11], first, the structure’s physical characteristics are unavailable or challenging to be modeled. Second, there are an adequate number of sensors installed for capturing the structure’s responses. Third, the computational operations are costly in the SHM project; in addition, multi-physics models consist of more physical processes in a system (e.g., thermal interactions, water precipitation, and magnetostatic and chemical reactions) may not seem efficient for utilizing a large amount of sensor data. The accuracy of physics-based depends on the response measurements; the best performance is achieved in an environment with the slightest amount of noise. In real-world structures and especially for in-servicing conditions, however, the amount of noise is considerable. As such, data-driven damage identifications deploying actual responses have revealed preferable adaptability and thereby turned into an inspiring solution in the realm of SHM [10].

1.1 Need for research

Although nonlinear damage has been studied before and practical solutions are proposed in this realm, most focus on damage identification as the first level based on Rytter’s classification levels in SHM [12]. Hence, limited research has been conducted to reach higher levels (e.g., damage localization and classification). This study attempts to address nonlinear damage detection in building structures through a robust data-driven approach. Adverse conditions such as environmental and operational effects in recording responses and analyzing signals are the other crucial points that should be considered. These issues become more though in the case of buildings where the story correlations can affect the structural responses. Therefore, proposing a robust model with appropriate precision in identifying different kinds of linear and nonlinear anomalies considering these issues leads to a practical approach in assessing real-world structures under adverse conditions.

Accordingly, the rest of the paper is organized as follows. In Sect. 2, related works are discussed, and gaps are highlighted once again. Case studies are presented in detail in Sect. 3. Section 4 provides the details of the proposed data-driven approach. Experimental results and discussion are given in Sect. 5. Finally, Sect. 6 concludes the work and suggests future directions.

2 Background

Signal processing techniques play a fundamental role in data-driven SHM for analysis responses in time, frequency, or time–frequency domains. Fourier spectra, spectrum analysis, difference frequency analysis, and the high-frequency resonance technique are appropriate for damage identification, especially for gear faults and roller bearings [13]. Wavelets proved the efficiency for damage and deterioration detection in building structures based on a stochastic approach [14]. Fourier transform (FT) and fast Fourier transform (FFT) are considered the main concepts for anomaly detection. A time-series model is a promising tool for simulating and predicting structural signals in the time-domain. Since this method is based on a partial structural dynamics model, it can identify even a small number of vibrations [15]. In this area, autoregressive (AR) models are investigated for damage and deterioration detection in buildings and bridges [1618]. Auto-regressive and moving average model (ARMA), as well as generalized autoregressive conditional heteroscedasticity model (GARCH), have proved to be beneficial for nonlinear damage identification in building specimens [19]. Transient behaviors caused by damage or adverse environmental conditions can be recognized through a signal’s time–frequency form [20].

In a broad perspective, the real-world signals are linear and stationary and are coupled with noise. Consequently, linear signal processing techniques, such as spectral analysis, are not appropriate in this realm of scope [21]. Hilbert–Huang transform (HHT), introduced by Huang et al. [22], consists of two sequential steps. The first step, called empirical mode decomposition (EMD), separates the complicated initial signal into a determined and commonly limited number of intrinsic mode functions (IMFs) or modes. Each mode is an oscillatory function with time-varying frequencies that reveals the input signals’ local features and corresponds to different frequencies and a residue [23, 24]. The algorithm detects the maxima/minima recursively, assesses the envelopes using the extrema, and removes the average envelopes, which leads to isolating high-frequency bands.[25]. In the next step, the Hilbert transform (HT) includes each IMF’s orthogonal pair with 90° difference in the phase [26]. As a result, each IMF set and the corresponding pair can evaluate instant variations of signal magnitude and frequency concerning time. Compared to wavelet analysis and Fourier transform, EMD benefits from tracing out the IMFs by interpolating between the extremums instead of using any given wavelet basis. Despite the wide usage of EMD in a variety of time–frequency applications such as medical [27], economics [28], climate predictions [29], SHM [30, 31], and many other fields, it may dace with some issues like sensitivity to noise and sampling frequency which cause the performance relies on the frequency ratio [25, 31, 32].

Some modified algorithms have been developed, including ensemble EMD (EEMD), complete ensemble EMD with adaptive noise (CEEMDAN), and variational mode decomposition (VMD) [32] to address these limitations. VMD is a relatively new algorithm that decomposes a signal into distinctive amplitude and frequency adjusted sub-signals where together they reproduce the primary input signal [32]. This approach is entirely non-recursive, and the sub-signals are extracted simultaneously; it is proven that VMD outperforms the EMD algorithm in various areas such as signal analysis and damage detection.

Variational mode decomposition has been deployed in the real SHM by some researchers. For instance, Bagheri et al. [31] calculated damping ratios for each extracted modal response obtained from VMD. The mode shape vector was obtained for each decomposed structure mode, which was then practiced for damage identification in three specimens, including numerical, experiment, and field case studies. Xin et al. [33] established two damage indices relying on modal parameters obtained from VMD. An experimental and numerical assessment demonstrated the efficiency of the method for nonlinear to find the location and severity of nonlinear damage scenarios in the models. Das and Saha [34] investigated the impact of a heavy noise environment on a new hybrid algorithm using VMD along with frequency-domain decomposition (FDD). It was deducted that the hybrid method could detect damage location accurately for noises above 20%. A novel methodology is illustrated and assessed in the following sections on two experimental specimens with linear and nonlinear damage scenarios.

3 Case studies

In this section, two case studies used in this work are thoroughly explained and discussed.

3.1 Case study 1: linear damage

The first case study is a three-story metal frame with aluminum columns and floors, investigated in linear damage simulation [35]. A roller at the base supports the specimen and can move horizontally using a hydraulic jack. Piezoelectric single-axis accelerometers instrument each floor. Nine linear damage scenarios are imitated employing stiffness reduction of columns and replacement of a 1.2 kg mass. Hence, 50 signals are recorded for each status with a sample rate of 320 Hz. Therefore, 450 signals are acquired for all scenarios, as illustrated in Table 1. As depicted, there are nine statuses, including healthy condition (S1) showing the intact structures without any changes in components, two scenarios simulate the operational and environmental effects by changing mass of floors (S2 and S3), and six damage scenarios by changing the stiffness of columns (S4–S9).

Table 1 Damage scenarios of case study 1 [17]

Additionally, Fig. 1 presents sample recorded signals in different scenarios, where \(y_{1} (t)\), \(y_{2} (t)\), and \(y_{3} (t)\) represent the recorded data by sensor 1, sensor 2, and sensor 3, respectively. It is evident that the recorded responses for all damage scenarios follow a random pattern, and usage of time-domain data cannot discriminate damage status from healthy cases. Thus, there is a need to model output responses through signal processing techniques to find suitable features indicating variations in the signals.

Fig. 1
figure 1

Samples from some scenarios in linear damage (scenario 1)

3.2 Case study 2: nonlinear damage

This case is the adjusted model of the first case study and is used for studying the impact of nonlinear damage. The sampling rate is the same as the linear model and is set to 322.58 Hz with 8192 data points for each record. Ten measurements are recorded for each state. Likewise, in the initial specimen, this frame also glides on rails that enable a transmission in one direction with the aid of an actuator. Four accelerometers with a sensitivity of 1000 mV/g are attached on the opposite side of the shaker at the center of the floors; thus, they do not help determine the specimen’s torsion models.

To simulate nonlinear damage, a mechanical bumper and a center column are installed onto the frame. This mechanism imitates the breathing crack and will cause nonlinear behaviors in the condition that the installed column hits the bumper, which is placed on the second floor. The adjustable gap between the bumper and the installed column is used for defining different degrees of nonlinearity. Hence, the larger the gap is, the smaller the nonlinear behavior becomes. The specimen’s outline and the damage scenarios are provided in Fig. 2 and Table 2, respectively. Some recorded nonlinear signals are given in Fig. 3, where \(y_{1} (t)\), \(y_{2} (t)\), \(y_{3} (t)\), and \(y_{4} (t)\), respectively, represent the recorded data by sensor 1, sensor 2, sensor 3, and sensor 4. Similar to the previous case, the time-domain presentation of responses cannot indicate variations due to damage properly.

Fig. 2
figure 2

Three-story bookshelf (adapted from [14])

Table 2 Damage scenarios of case study 2 [1735]
Fig. 3
figure 3

Samples from nonlinear scenarios in case study 2

As noted, two three-story models were presented for linear and nonlinear damage scenarios. Linear damage were simulated by reducing the cross-section area of columns, while nonlinear behavior was considered as hitting a bumper with a mid-column in the second case study. The environmental and operational conditions were also considered by adding a mass to different damage scenarios. Story accelerations were recorded for damage identification and classification, with a novel methodology discussed in the following section.

4 Proposed method

In this work, anomaly detection is performed in three steps. First, VMD decomposes the signal into several sub-signals with separated bandwidths. Second, primary features are extracted using the time-series modeling, and then, the number of features is reduced by KPCA and KDA. Finally, three supervised classifiers are separately deployed to discriminate different damage states within three specimens. A schematic workflow of the proposed method is depicted in Fig. 4. In the following, these stages are illustrated thoroughly.

Fig. 4
figure 4

Workflow of the proposed method

4.1 Signal processing

Herein, the input acceleration signals are decomposed using VMD, so that an input signal \(S(t)\) is broken down into \(d\) limited-bandwidth IMFs depicted as [36]

$$u_{k} (t) = A_{k} (t)\cos (\omega_{k} (t)),$$
(1)

where \(A_{k} (t)\) and \(\omega_{k} (t)\) present the instantaneous amplitude and frequency of \(u_{k} (t)\), respectively. The constructed variational problem is obtained using Hilbert transform as follows:

$$\mathop {\min }\limits_{{u_{k} ,\,\omega_{k} }} \left\{ {\sum\limits_{k} {\left\| {\partial_{t} \left( {\sigma (t) + \frac{{ju_{t} (t)}}{t}} \right)e^{{ - j\omega_{k} t}} } \right\|}_{2}^{2} } \right\},$$
(2)

such that

$$S_{t} = \sum\limits_{k} {u_{k} (t)} ,$$
(3)

where \(\partial (t)\) denotes the partial derivative of \(t,\)\(\{ u_{k} (t)\} = \{ u_{1} (t),...,u_{n} (t)\}\) and \(\{ \omega_{k} \} = \{ \omega_{1} ,...,\omega_{n} \}\) shows the IMFs of signal \(S_{t}\) and their center frequencies of each signal sub-band, respectively. Equation (2) is presented in a Lagrange function using \(\lambda\) and \(\alpha\) as a multiplier operator and penalty factor, respectively, to solve the optimization problem

$$L\left( {\{ u_{k} \} ,\{ \omega_{k} \} ,\lambda } \right) = \alpha \sum\limits_{k} {\left\| {\partial_{t} \left( {\sigma (t) + \frac{j}{t}u_{t} (t)} \right)e^{{ - j\omega_{k} t}} } \right\|}_{2}^{2} + \left\| {S(t) - \sum\limits_{k} {u_{k} (t)} } \right\|_{2}^{2} + \left\langle {\lambda (t),S(t) - \sum\limits_{k} {u_{k} (t)} } \right\rangle .$$
(4)

Afterward, Eq. (4) is transformed into the time–frequency space, and the equivalent extremum solution is solved to obtain the frequency-domain form of the modal element \(u_{k} (t)\) as well as the center frequency \(\omega_{k}\)

$$u_{k}^{n + 1} (\omega ) = \frac{{f(\omega ) - \sum\nolimits_{i = 1,i \ne k}^{k} {u_{i} (\omega ) + 0.5\lambda (\omega )} }}{1 + 2}\alpha (\omega - \omega_{k} )^{2} ,$$
(5)
$$\omega_{k}^{n + 1} = \frac{{\int_{0}^{\infty } {\omega \left| {u_{k} (\omega )} \right|^{2} {\text{d}}\omega } }}{{\int_{0}^{\infty } {\left| {u_{k} (\omega )} \right|^{2} {\text{d}}\omega } }}.$$
(6)

Finally, the alternative direction of multipliers (ADMM) is deployed to optimize the constrained variational model. Subsequently, the initial signal \(S(t)\) is broken down by \(d\) IMFs as described in the following:

  • Initialize the parameters \(\{ u_{k} \} ,\{ \omega_{k} \} ,\{ \lambda^{1} \} \,\,{\text{and}}\,\,n \to 0\)

  • The value of \(u_{k}^{n + 1}\) and \(\omega_{k}^{n + 1}\) is updated according to (5) and (6).

  • The \(\lambda^{n + 1}\) is updated as stated in

    $$\lambda^{n} (\omega ) + \tau \left( {f(\omega ) - \sum\limits_{k}^{n + 1} {u_{k} (\omega )} } \right).$$
    (7)
  • Equation (7) is continued till the following criteria are satisfied:

    $$\frac{{\sum\nolimits_{k} {\left\| {u_{k}^{n + 1} - u_{k}^{n} } \right\|}_{2}^{2} }}{{\left\| {u_{k}^{n} } \right\|_{2}^{2} }} < \varepsilon .$$
    (8)

Proved that the above condition is met, the iteration procedure stops.

Herein, the iteration is stopped; otherwise, it returns to step 2, and \(d\) IMFs can be extracted [31, 36]. In Figs. 5, 6, 7, and 8, the IMFs of linear and nonlinear signals are shown. Due to space limitations, we only present the two IMFs.

Fig. 5
figure 5

IMF of signals from scenario S1 (healthy) in case study 1

Fig. 6
figure 6

IMF of signals from scenario S9 in case study 1

Fig. 7
figure 7

IMF of signals from scenario S7 in case study 2

Fig. 8
figure 8

IMF of signals from scenario S9 in case study 2

4.2 Feature extraction

4.2.1 GARCH modeling of IMFs

Generally speaking, a signal can be modeled via ARMA time-series to evaluate the conditional mean. As an illustration, the ARMA(p, q) prediction for the conditional mean is formulated as [37]

$$S_{t} = \sum\limits_{i = 1}^{p} {\varphi_{i} S_{t - i} + } \sum\limits_{j = 1}^{q} {\theta_{j} \varepsilon_{t - j} } + \varepsilon_{t} + c,$$
(9)

where p denotes the autoregressive model order, \(\varphi_{i}\) presents the autoregressive variable, q stands for the moving average model order, \(\theta_{j}\) shows the moving average variable, \(\varepsilon_{t}\) denotes the residual, and c is a constant. However, the residual is usually considered to have a mean of zero with constant variance. In some time-series, it is not homoscedastic and has no constant variance [37]. In this case, the time-varying variance is called conditional variance that is described as

$$\sigma_{t}^{2} = {\text{var}}_{t - 1} (\varepsilon_{t} ) = E_{t - 1} \left( {\varepsilon_{t}^{2} } \right).$$
(10)

The GARCH model, established by Bollersl [38], is a dynamic model that addresses the conditional heteroscedasticity or volatility clustering for an innovation process using a weighted combination of past heteroscedasticity functions coupled with the squared residuals of the past. It causes a reduction in the parameters and complexity of the model. A \({\text{GARCH}}(r,m)\) model for the conditional variance of residual \(\varepsilon_{t}\) is formed as

$$\sigma_{t}^{2} = \beta + \sum\limits_{i = 1}^{r} {b_{i} \sigma_{t - j}^{2} } + \sum\limits_{j = 1}^{m} {a_{j} \varepsilon_{t - j}^{2} } .$$
(11)

In which \(\beta\), \(b_{i}\), and \(a_{j}\) are the parameters of the GARCH model. Herein, the following constraints are defined to ensure that the conditional variance is positive:

$$\beta > 0,b_{i} \ge 0,a_{j} \ge 0.$$
(12)

Moreover, the following formula is defined to make the covariance stationary:

$$\sum\limits_{i = 1}^{r} {b_{i} } + \sum\limits_{j = 1}^{m} {a_{j} } < 1.$$
(13)

This paper utilizes the GRACH model to create the conditional variance model for IMFs obtained from VMD. The GARCH model showed reliable performance in nonlinear problems, as discussed in [19]. The coefficients of \({\text{GARCH}}(r,m)\), i.e., \(\{ b_{i} \}\) and \(\{ a_{j} \}\), are considered as features. Hence, kth IMF is described by \(\left\{ {b_{1}^{(k)} , \ldots ,b_{r}^{(k)} ,a_{1}^{(k)} , \ldots ,a_{m}^{(k)} } \right\}\). Considering \(d\) IMFs, the feature vector of signal with \(d(r + m)\) features, \({\mathbf{f}}_{(d(r + m)) \times 1}^{{}}\), is constructed as

$${\mathbf{r}} = \left[ {b_{1}^{(1)} , \ldots ,b_{r}^{(1)} ,a_{1}^{(1)} , \ldots ,a_{m}^{(1)} ,b_{1}^{(d)} , \ldots ,b_{r}^{(d)} ,a_{1}^{(d)} , \ldots ,a_{m}^{(d)} } \right]^{{\text{T}}} .$$
(14)

Finally, since each signal is recorded by several sensors, each record is described with \(n_{f} = \sum\nolimits_{i = 1}^{n} {d_{i} (r + m)}\) features, where \(n\) shows the number of sensors and \(d_{i}\) stands for the number of IMFs is used to decompose the signal of the ith sensor. Hence, the feature vector of a signal with \(n\) sensors is given as \({\mathbf{f}} = \left[ {{\mathbf{r}}_{1}^{{\text{T}}} , \ldots ,{\mathbf{r}}_{n}^{{\text{T}}} } \right]^{{\text{T}}}\). All obtained features are not suitable for classification, and feature vectors may suffer from redundant features. Hence, we should utilize feature reduction techniques to remove such features from the feature vector.

4.2.2 Feature reduction

The general concept of kernel-based feature reduction is based on deploying a particular sort of nonlinear mapping function to protrude the initial vector f into a high-dimensional feature space as F. Regarding the new feature space, the principal components are obtained through the regular principal component analysis (PCA). In other words, the principal nonlinear components in the initial space correspond to the principal components in feature space F. Afterward, the kernel functions, including polynomial, radial basis function, and sigmoid, are used to perform the nonlinear mapping in KPCA [39].

Assume nonlinear mapping \(\phi\); the initial data space \({\mathbb{R}}^{{n_{f} }}\) is mapped into a new feature space like \({\rm H}\) as [40]

$$\begin{gathered} \phi :{\mathbb{R}}^{{n_{f} }} \to {\text{H}} \hfill \\ \,\,\,\,\,\,\,\,{\mathbf{F}} \mapsto \phi {(}{\mathbf{F}}{)}{\text{.}} \hfill \\ \end{gathered}$$
(15)

For a training sample set \({\mathbf{f}}_{1} ,{\mathbf{f}}_{2} ,...,{\mathbf{f}}_{M}\) in \({\mathbb{R}}^{{n_{f} }}\), where \(M\) denotes training sample numbers. Subsequently, the covariance matrix is formulated as [40]

$${\mathbf{S}}_{{}}^{\phi } = \frac{1}{M}\sum\nolimits_{j = 1}^{M} {\left( {\phi ({\mathbf{f}}_{j} ) - {\mathbf{m}}_{0}^{\phi } } \right)\left( {\phi ({\mathbf{f}}_{j} ) - {\mathbf{m}}_{0}^{\phi } } \right)^{{\text{T}}} } ,$$
(16)

such that

$${\mathbf{m}}_{0}^{\phi } = \frac{1}{M}\sum\nolimits_{j = 1}^{M} {\phi ({\mathbf{f}}_{j} )} .$$
(17)

Since \({\mathbf{S}}_{{}}^{\phi }\) it is a bounded, compact, positive, and symmetric matrix, its nonzero values are also positive. For the sake of finding these nonzero values, Schölkopf et al. [41] suggested linearly express every eigenvector of \({\mathbf{S}}_{{}}^{\phi }\) by [40]

$$\beta = \sum\nolimits_{i = 1}^{M} {\alpha_{i} \phi ({\mathbf{f}}_{i} )} .$$
(18)

To compute expansion coefficients, the Gram matrix is formed as \(\tilde{R} = {\mathbf{Q}}^{{\text{T}}} {\mathbf{Q}}\), where \({\mathbf{Q}} = [\phi ({\mathbf{F}}_{1} ),...,\phi ({\mathbf{F}}_{M} )]\). Consequently, each component \({\mathbf{Q}}\) is computed using kernel tricks as [40]

$$\tilde{R}_{ij} = \phi \left( {{\mathbf{f}}_{i} } \right)^{{\text{T}}} \phi \left( {{\mathbf{f}}_{j} } \right) = \left( {\phi ({\mathbf{f}}_{i} ).\phi ({\mathbf{f}}_{j} )} \right) = K\left( {{\mathbf{f}}_{i} ,{\mathbf{f}}_{j} } \right).$$
(19)

Accordingly, \({\tilde{\mathbf{R}}}\) is centralized by [40]

$${\mathbf{R}} = {\tilde{\mathbf{R}}} - {\mathbf{1}}_{M} {\tilde{\mathbf{R}}} - {\tilde{\mathbf{R}}\mathbf{1}}_{M} + {\mathbf{1}}_{M} {\tilde{\mathbf{R}}\mathbf{1}}_{M} ,$$
(20)

where

$${\mathbf{1}}_{M} = \left( \frac{1}{M} \right)_{M \times M} .$$
(21)

Afterward, the orthonormal eigenvectors \(\gamma_{1} ,\cdots,\gamma_{{n_{{\text{p}}} }}\) of R are calculated related to \(n_{{\text{p}}}\) the most significant positive eigenvalues, such that \(\lambda_{1} \ge \lambda_{2} \ge \cdots \ge \lambda_{{n_{{\text{p}}} }}\). Consequently, the orthonormal eigenvectors \(\beta_{1} ,\beta_{2} ,...,\beta_{{n_{p} }}\) of corresponding \({\mathbf{S}}_{{}}^{\phi }\) are obtained via [40]

$$\beta_{j} = \frac{1}{{\sqrt {\lambda_{j} } }}Q\gamma_{j} \,;\quad j = 1,...,n_{{\text{p}}} .$$
(22)

After that, the KPCA transformed feature \({\mathbf{y}} = \left( {y_{1} ,...,y_{{n_{{\text{p}}} }} } \right)^{{\text{T}}}\) vector is obtained by the projection of the mapped sample \(\phi ({\mathbf{f}})\) onto the eigenvector \(\beta_{1} ,\beta_{2} ,...,\beta_{{n_{{\text{p}}} }}\) as formulated below [40]

$${\mathbf{y}} = \left( {\beta_{1} ,\beta_{2} ,...,\beta_{{n_{p} }} } \right)^{{\text{T}}} \phi ({\mathbf{f}}).$$
(23)

The training matrix \({\mathbf{F}} = \left[ {{\mathbf{f}}_{1}^{{\text{T}}} ;\,\,{\mathbf{f}}_{2}^{{\text{T}}} ; \ldots ;{\mathbf{f}}_{M}^{{\text{T}}} } \right]^{{\text{T}}}\) with the size of \(n_{{\text{f}}} \times M\) is mapped to the matrix \({\mathbf{Y}} = \left[ {{\mathbf{y}}_{1}^{{\text{T}}} \,;\,\,{\mathbf{y}}_{2}^{{\text{T}}} ; \ldots ;{\mathbf{y}}_{M}^{{\text{T}}} } \right]^{{\text{T}}}\) with the size of \(n_{{\text{p}}} \times M.\)

The aim of linear LDA is as follows [42]:

$${\mathbf{a}}_{{{\text{opt}}}} = \arg \max \frac{{{\mathbf{a}}^{{\text{T}}} {\mathbf{S}}_{{\text{b}}} {\mathbf{a}}}}{{{\mathbf{a}}^{{\text{T}}} {\mathbf{S}}_{{\text{w}}} {\mathbf{a}}}},$$
(24)

where \({\mathbf{S}}_{{\text{b}}}\) and \({\mathbf{S}}_{{\text{w}}}\) reveal the between-class and within-class scatter matrices, which are obtained as

$${\mathbf{S}}_{{\text{b}}} = \sum\nolimits_{k = 1}^{{n_{{\text{c}}} }} {m_{k} \left( {{\varvec{\mu}}^{(k)} - {\varvec{\mu}}} \right)\left( {{\varvec{\mu}}^{(k)} - {\varvec{\mu}}} \right)^{{\text{T}}} } ,$$
(25)
$${\mathbf{S}}_{{\text{w}}} = \sum\nolimits_{k = 1}^{{n_{{\text{c}}} }} {\left( {\sum\nolimits_{i = 1}^{{m_{k} }} {\left( {{\mathbf{y}}_{i}^{(k)} - {\varvec{\mu}}^{(k)} } \right)\left( {{\mathbf{y}}_{i}^{(k)} - {\varvec{\mu}}^{(k)} } \right)^{{\text{T}}} } } \right)} ,$$
(26)

where \({\varvec{\mu}}\) is the global mean, \(m_{k}\) stands for the number of samples in the kth class, and \({\varvec{\mu}}^{(k)}\) denotes the mean of the kth class. Afterward, the total scatter matrix is defined as \({\mathbf{S}}_{t} = {\mathbf{S}}_{{\text{b}}} + {\mathbf{S}}_{{\text{w}}} ,\). The optimum values of a correspond to the nonzero eigenvalue of eigenproblem

$${\mathbf{S}}_{{\text{b}}} {\mathbf{a}} = \lambda {\mathbf{S}}_{t} {\mathbf{a}}.$$
(27)

A maximum number of \(n_{{\text{c}}} - 1\) eigenvectors are obtained corresponding to nonzero eigenvalues, because the rank of \({\mathbf{S}}_{{\text{b}}}\) is limited to \(n_{{\text{c}}} - 1\). Similar mapping (15) is considered to extend the LDA to the nonlinear case. Hence, \({\mathbf{S}}_{{\text{b}}}^{\varphi }\), \({\mathbf{S}}_{{\text{w}}}^{\varphi }\), and \({\mathbf{S}}_{t}^{\varphi }\), respectively, stand for the between-class, within-class, and total scatter matrices in feature space, which are obtained by the following formulation:

$${\mathbf{S}}_{{\text{b}}}^{\varphi } = \sum\nolimits_{k = 1}^{{n_{{\text{c}}} }} {m_{k} \left( {{\varvec{\mu}}_{\varphi }^{(k)} - {\varvec{\mu}}_{\varphi } } \right)\left( {{\varvec{\mu}}_{\varphi }^{(k)} - {\varvec{\mu}}_{\varphi } } \right)^{{\text{T}}} } ,$$
(28)
$${\mathbf{S}}_{w}^{\varphi } = \sum\nolimits_{k = 1}^{{n_{c} }} {\left( {\sum\nolimits_{i = 1}^{{m_{k} }} {\left( {\varphi \left( {{\mathbf{y}}_{i}^{(k)} } \right) - {\varvec{\mu}}_{\varphi }^{(k)} } \right)\left( {\varphi \left( {{\mathbf{y}}_{i}^{(k)} } \right) - {\varvec{\mu}}_{\varphi }^{(k)} } \right)^{{\text{T}}} } } \right)}$$
(29)
$${\mathbf{S}}_{t}^{\varphi } = \sum\nolimits_{i = 1}^{M} {\left( {\varphi \left( {{\mathbf{y}}_{i}^{{}} } \right) - {\varvec{\mu}}_{\varphi }^{{}} } \right)\left( {\varphi \left( {{\mathbf{y}}_{i}^{{}} } \right) - {\varvec{\mu}}_{\varphi }^{{}} } \right)^{{\text{T}}} } .$$
(30)

Assume that \({\varvec{\nu}}\) shows the projective function in feature space, and the associated objective function in feature space is defined as

$${\varvec{\nu}}_{{{\text{opt}}}} = \arg \max \frac{{{\varvec{\nu}}^{{\text{T}}} {\mathbf{S}}_{{\text{b}}}^{\varphi } {\varvec{\nu}}}}{{{\varvec{\nu}}^{{\text{T}}} {\mathbf{S}}_{t}^{\varphi } {\varvec{\nu}}}}.$$
(31)

This function can be solved by eigenproblem as

$${\mathbf{S}}_{{\text{b}}}^{\varphi } {\varvec{\nu}} = \lambda {\mathbf{S}}_{t}^{\varphi } {\varvec{\nu}}.$$
(32)

And we have

$${\varvec{\nu }} = \sum\nolimits_{{i = 1}}^{M} {\alpha _{i} \varphi \left( {{\mathbf{y}}_{i} } \right)} .$$
(33)

Then, we can define an equivalent problem as:

$$\varvec{\alpha }_{{{\text{opt}}}} = \arg \max \frac{{\varvec{\alpha }^{{\text{T}}} {\mathbf{KWK}}\varvec{\alpha }}}{{\varvec{\alpha }^{{\text{T}}} {\mathbf{KK}}\varvec{\alpha }}},$$
(34)

where \({\varvec{\alpha}} = [\alpha_{1} ,...,\alpha_{M} ]^{{\text{T}}}\). The corresponding eigenproblem is as \({\mathbf{KWK}}\varvec{\alpha} = \lambda {\mathbf{KK}}{\varvec{\alpha}}\), where K shows the kernel matrix, i.e., \(K_{ij} = \kappa ({\mathbf{y}}_{i} ,{\mathbf{y}}_{j} )\) and W is defined as

$$W_{ij} = \left\{ {\begin{array}{*{20}c} {{1 \mathord{\left/ {\vphantom {1 {m_{k} }}} \right. \kern-\nulldelimiterspace} {m_{k} }},} & {{\text{if}}\,{\mathbf{y}}_{i} \,\,{\text{and}}\,\,{\mathbf{y}}_{j} \,\,{\text{both}}\,\,{\text{belongs}}\,\,{\text{to}}\,\,{\text{the}}\,\,k{\text{th}}\,\,{\text{class}}} \\ {0,} & {{\text{otherwise}}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,} \\ \end{array} } \right..$$
(35)

Each eigenvector \({\varvec{\alpha}}\) provides a projective function \({\varvec{\nu}}\) in the feature space. Let y a data, and then, we have

$$\left\langle {{\varvec{\nu}},\varphi ({\varvec{y}})} \right\rangle = \sum\nolimits_{i = 1}^{m} {\alpha_{i} \left\langle {\varphi ({\varvec{y}}_{i} ),\varphi ({\varvec{y}})} \right\rangle } = \sum\nolimits_{i = 1}^{m} {\alpha_{i} \kappa \left( {{\varvec{y}}_{i} ,{\varvec{y}}} \right)} = {\varvec{\alpha}}^{{\text{T}}} \kappa \left( {:,{\varvec{y}}} \right),$$
(36)

where \(\kappa \left( {:,{\varvec{y}}} \right) \doteq \left[ {\kappa \left( {{\varvec{y}}_{1} ,{\varvec{y}}} \right), \ldots ,\kappa \left( {{\varvec{y}}_{m} ,{\varvec{y}}} \right)} \right]^{{\text{T}}}\). Let \(\left\{ {{\varvec{\alpha}}_{1} , \ldots ,{\varvec{\alpha}}_{{n_{{\text{c}}} - 1}} } \right\}\) be the \(n_{{\text{c}}} - 1\) eigenvectors of the eigenproblem concerning nonzero eigenvalues. The transformation matrix \(\Theta = \left[ {{\varvec{\alpha}}_{1} , \ldots ,{\varvec{\alpha}}_{{n_{{\text{c}}} - 1}} } \right]\) is \(M \times (n_{{\text{c}}} - 1)\) a matrix that embeds the data sample y into \(n_{{\text{c}}} - 1\) dimensional subspace by

$${\mathbf{y}} \to {\mathbf{z}} = \Theta^{{\text{T}}} \kappa (:,{\mathbf{y}}).$$
(37)

4.3 Classification

In the next section, three classifiers are applied to the selected features previously taken and are called predictors. These classifiers are prevailing in the realm of Machine Learning, including support vector machine (SVM), fine tree, and k-nearest neighbor (kNN). SVM is a supervised training algorithm founded on the fact that measurements can be considered two-dimensional space. Each sample denotes a data point in the space and can be separated by a line in the case of a two-dimensional problem and a plane in the case of the dimensional system [43]. Regarding kNN, despite its simplicity, it is common in terms of suing in large training datasets. It allocates an estimated value to a new sample on the ground of plurality or weighted of the k-nearest neighbors in the training set [44]. Classification using a decision tree (fine tree) algorithm is very fast and suitable for high-dimensional classification problems. A fine tree is a predictive algorithm mapping from samples about an item to conclusions about its target value. In this model, leaves represent the labels, nodes are the features, and branches denote the junction of features, resulting in label classification [45]. Subsequently, the prediction using these classifiers is compared with each in the following sections.

5 Results and discussion

This section provides the experimental results and relevant discussions. We considered the fivefold cross-validation to assess the performance of the proposed method. To this end, data were randomly partitioned into five equal-sized groups, and then, the training and testing procedures were repeated for five trials. One group was considered for testing data in each trial, and other groups were used to train the classifier. Finally, results were averaged.

5.1 The effect of the number of IMFs on residual

The number of IMFs has a considerable effect on the number of extracted features and the complexity of the proposed method. Here, we determine the efficient number of IMFs based on the mean absolute of residuals, shown in Fig. 9 for different numbers of IMFs of nonlinear signals. It is observed that residual generally reduces as the number of IMFs increases. However, the slope of reduction varies for different sensors. The residuals of sensors 2, 3, and 4 reduce faster than that of sensor 1. As observed, the residual of sensor one does not have a significant variation when the number of IMFs are greater than ten. On the other side, the reduction in residuals of sensors 2, 3, and 4 is not notable for the number of IMFs greater than seven. Hence, we consider the ten IMFs for sensor one and seven IMFs for the remaining sensors. Considering 31 IMFs and two features extracted from each IMF, each recording is described with 62 features.

Fig. 9
figure 9

The residual of VMD of nonlinear data for different numbers of IMFs and data length. a Length of 512, b length of 2048, and c length of 8192

Following the linear case, as observed in Fig. 10, the residuals of all sensors dwindle gradually at nearly the same pace. For any figures over eight IMFs, the residual does not show significant deviations. Thus, for the linear signals, the eight values of IMFs are assigned for all sensors of stories. Considering two features for each IMF, each record is denoted through 48 features.

Fig. 10
figure 10

The residual of VMD of linear data for different numbers of IMFs and data length. a Length of 512, b length of 2048, and c length of 8192

5.2 Classification accuracy

To assess the stability of the proposed method and evaluate the effect of features on results, the authors considered four cases as follows:

  • SA: no feature reduction method is employed

  • SB: only KPCA is used for feature reduction

  • SC: only KDA is employed for feature reduction

  • SD: at first, KPCA and then KDA is considered for feature reduction.

The number of features in conditions SB, SC, and SD is obtained based on the normalized cumulative summation of eigenvalues (NCSE). When the NCSE reaches higher than 0.95 for the first time, the efficient number of features is obtained. Considering \(\left[ {\lambda_{1} , \cdots ,\lambda_{{n_{{\text{f}}} }} } \right]\) as sorted eigenvalues in descending order, the NSCE is calculated as follows:

$$\Lambda_{i} = \frac{{\sum\nolimits_{k = 1}^{i} {\lambda_{k} } }}{{\sum\nolimits_{k = 1}^{{n_{{\text{f}}} }} {\lambda_{k} } }};\quad i = 1, \cdots ,n_{{\text{f}}} .$$

Classification accuracy of the proposed method for nonlinear and linear data considering kNN, SVM, and fine tree classifiers and different lengths of signals obtained from sensors are given in Tables 3 and 4, respectively.

Table 3 Classification accuracy of the proposed method for different classifiers and lengths of nonlinear data
Table 4 Classification accuracy of the proposed method for different classifiers and lengths of linear data

Concerning the nonlinear case, the minimum and maximum performance are observed in scenario SA and SD with 76.92% and 98.82%, respectively. In all scenarios, fine tree classifiers seem to be more efficient compared to the other classifiers. Moreover, kNN is the second accurate classifier, and SVM indicates the lowest performance in this case. It is noteworthy that the signal length has the highest impact on the SB and the lowest on SD with the relative variation (\(\Delta_{\max }\)) of 9.09% and 3.69%, respectively.

Regarding the nonlinear case study, the highest and lowest performance, likewise the nonlinear case, were observed in SA and SD with the accuracy of 100.0% and 89.56%, respectively. Similar to the previous case, the fine tree is the suitable classifier in all proposed scenarios. Except for the SB, kNN indicates higher performance in comparison with SVM. Scenario SB reveals less sensitivity to the signal length, whereas scenario SA shows the highest sensitivity to the signal variations based on \(\Delta_{\max }\).

5.3 Confusion matrix

In this part, the classification performance for both case studies is provided through confusion matrices. Considering the confusion matrix, we provide the recall or sensitivity (Sens.), precision (Prec.), total accuracy (Acc.), and F-score, which are defined as

$${\text{Sens}}. = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}},$$
(38)
$${\text{Prec}}. = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}},$$
(39)
$${\text{Acc}}. = 100\frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}},$$
(40)
$$F{\text{{-}score}} = 2\frac{{{\text{Prec}}. \times {\text{Sens}}.}}{{{\text{Prec}}. + {\text{Sens}}.}} = \frac{{{\text{TP}}}}{{{\text{TP}} + 0.5({\text{FP}} + {\text{FN}})}},$$
(41)

where TP, TN, FP, and FN denote the true positive, true negative, false positive, and false negative, respectively.

The results are given in Table 5 for the linear damage and the performance metrics are computed for the nine scenarios described earlier. As indicated, the proposed method determines all damage scenarios with no errors. Consequently, this approach expresses the highest performance for discriminating linear damage based on reference to this study.

Table 5 Confusion matrix for linear case study

Regarding the nonlinear case study, 17 separate states of the specimen are predicted through the presented technique, and the results are presented by the confusion matrix as depicted in Table 6. As noted, in the majority of the damage states, the prediction accuracy is 100%. Regarding the remaining cases, which are two out of seventeen scenarios, the classification performance is 90.0%. Subsequently, the established strategy revealed considerable performance for recognizing nonlinear and linear damage with significant precision.

Table 6 Confusion matrix for nonlinear case study

5.4 The effect of noise

The various intensity of noises is applied to the responses on the grounds of signal–noise ratio (SNR) to assess the stability of the proposed method against noise, as depicted in Fig. 11. As observed, the proposed method is efficient even in environments contaminated with severe noise (SNR = 1). Furthermore, the established approach can maintain its performance against noise where it shows insignificant variations in the case of SNR of 20 and 15.

Fig. 11
figure 11

Effect of noise on classification performance

6 GARCH effect assessment

In this section, two tests are applied to demonstrate the compatibility of the GARCH model [46]. Thus, Kurtosis and ARCH tests are provided in the following sections.

6.1 Kurtosis test

GARCH model is appropriate for those signals that have the shape of heavy tails. Therefore, the Kurtosis test is utilized to find out that signals have heavy tails or not. The Kurtosis for a distribution (s) is formulated as follows [46]:

$$K(s) = \frac{{E(s - \mu )^{4} }}{{\sigma^{4} }},$$
(42)

where \(\mu\) and \(\sigma\) denote the mean and standard deviation of distribution s, respectively, and \(E(s)\) stands for the expected value of s. For Gaussian distribution, the Kurtosis value of three and higher values shows that the distribution of coefficients has a heavier tail than the Gaussian distribution. This paper applies this test to the IMFs for each sensor, and the average results for minimum and maximum values of sub-bands are presented in Table 7. Regarding the results, it can be seen that the maximum values are higher than 3, which proves that the IMFs do not have Gaussian distribution.

Table 7 Kurtosis test for IMFs

6.2 ARCH test

Based on the hypothesis provided in [47], the ARCH test is deployed to see the existence of ARCH/GARCH impact in the IMFs of each sensor. In this reference, the Lagrange multiplier test is presented based on regression. Subsequently, the test statistic is asymptotically Chi-square distributed has q degrees of freedom [46].

Thus, in this part, the ARCH test is applied to the IMFs for different sub-bands, and the average results for signals are shown in Table 8. In this table, h stands for the Boolean decision variable, where 1 shows the rejection of the null hypothesis, which depicts that no GARCH effect exists. p Value is the significance level at which the test rejects the null hypothesis. GARCHstat and CriticalValue are the ARCH test static and critical values of the Chi-square distribution, respectively. Based on this test, if GARCHstat is less than the critical value, no GARCH effect exists. In this study, the significance level is set to 0.05, frequently deployed in [48]. Notably, these results are the average of all signals; for example, the average value of h for the fourth IMF of the first sensor is 0.74, which demonstrates that 74% of the signals have the GARCH effect. Thus, in general, the results of the table prove the existence of the GARCH effect in most cases.

Table 8 ARCH test for IMFs

7 Conclusion

In this paper, a novel methodology was proposed with the potential to identify and classify linear and nonlinear damage in building structures. Here, the VMD was applied to address the variational conditioning in the input signals and the GARCH model used for modeling the decomposed signals. Afterward, the IMFs were deployed as the features of input signals. It was revealed that using all IMFs led to an increase in residuals. Thus, KPCA and KDA are applied to the extracted features, respectively, to find the optimum and appropriate features. It was observed that using kernel-based dimensional reduction could enhance classification performance using SVM, KNN, and fine tree algorithms. It was demonstrated through the use of two empirical models that the proposed method could discriminate linear damage states correctly and without any error and classify nonlinear damage with significant accuracy. Moreover, the proposed method proves its efficiency even in a highly noisy environment with an SNR of 20 and 15. Finally, to see the existence of the GARCH effect, Kurtosis and ARCH tests were deployed, and the results showed that IMFs followed the GARCH effect; thereby, they were appropriate candidates for the proposed method.

The authors suggest the application of VMD and the GARCH model for unsupervised approaches and reinforcement learning. Moreover, optimization algorithms such as particle swarm optimization (PSO) and grey wolf optimizer (GWO) could be deployed to find the optimum number of features. The current limitation of the proposed method is sensitivity to the noisy signals, which can be solved by SNR estimation and reducing the noise by signal processing approaches. Also, we can consider the semi-supervised schemes to reduce the effect of noisy features on the performance of feature reduction schemes.