1 Introduction

Nowadays, human beings are continuously fashioning to build steel-framed structures like buildings, bridges, dams, electrical transmission towers, etc., due to economic benefits, ease of construction, and the well-being of the nation. These structures have different structural members that are connected by rivets, bolts, and welding joints. In their service period, these structures are prone to connection damage due to man-made mistakes, corrosion, fatigue, environmental variability, and unpredictable events such as earthquakes, and underground mines. If the damage to the structure is not identified in its early stages, it spreads throughout the structure and leads to its sudden failure, causing the loss of human lives as well as property. In this regard, structural health monitoring (SHM) gives an exact solution to prevent such sudden failures by continuously or regularly monitoring the structural integrity.

The SHM techniques are classified into local and global (vibration) techniques. In the past, local techniques such as acoustoelastic effect-based methods (Wang and Song 2019), ultrasonic techniques (Mutlib et al. 2016), vision-based methods (Fukuda et al. 2010), piezoelectric impedance methods, modal strain energy methods (Pal and Banerjee 2015), and displacements (Park et al. 2015) have also been utilized in SHM to identify joint damage in frame structures. However, the local techniques are not suitable for most structures because of their expense and inefficiency; hence, researchers moved to vibration-based (VB) techniques. The VB techniques are employed to assess the entire performance of the monitored structure by converting its vibration response into a meaningful damage identification parameter that indicates the real condition of the structure, which made the VB techniques more popular. The ultimate aim of these techniques is to detect damage by processing data that is acquired from an acquisition system like accelerometers and strain gauges (Bandara et al. 2014).

In recent years, machine learning (ML) models have been extensively utilized in vibration-based health monitoring techniques. SHM using ML models addresses the pattern identification problem (Fallahian et al. 2022). However, the effectiveness of the ML-based models depends on the selected ML algorithms, samples, and the number of learning samples. Among the ML models, deep learning (DL) models have already become the most popular with their impressive performance in many scientific areas (Lecun et al. 2015). DL models use several learning layers that are multi-layered to find out how input and output datasets are related. Convolutional neural networks (CNN) are newly developed DL techniques that adopt how the brain of humans works. CNNs are an amazing tool for extracting and classifying features. They are mostly used to recognize data like pictures and videos (Konstantinidis et al. 2020). The complete overview and working procedure of CNN are explained in Sect. 2.

A study (Yun et al. 2001) proposed a method for the identification of joint damage in multi-story plane frames using modal parameters and an ANN algorithm. In the work, the damage was simulated by a rotational spring at the end of a beam component, and the damage quantification was denoted by the decreased ratio of the joint fixity factor. It was found that the damage can be accurately estimated even though the modal data were extremely contaminated with noise. Similarly, studies (Huang et al. 2017; Ng 2014) detected brace damage and joint damage in an ASCE benchmark building using the Bayesian framework. Lei et al. (2014) proposed a model based on a two-step Kalman filter methodology for the identification of joint damage in a frame under earthquake excitation. The SVM and principle component analysis-based damage identification technique was presented by Bolourani et al. (2021). A study (Chen and Zang 2009) presented an ML technique for member damage detection in ASCE benchmark building using the Artificial Immune Pattern Recognition classifier. The results show that the technique provides better accuracy compared to Naïve-Bayes and KNN classifiers while it underperforming SVM. A study (Salkhordeh et al. 2021) presented a decision-tree-based classification model for detecting the intensities of member damage in a braced steel frame structure. In the work, the features, namely drift, correlation, and energy ratio, were extracted from the raw acceleration data and classified into damage levels. In addition, Gui et al. (2017) presented an SHM method for a 3D steel frame structure using the autoregressive (AR) and residual error feature-based SVM algorithms. In the work, the features were obtained from the acceleration time series data. Later, these features were fed into the SVM algorithm to classify the damaged and various undamaged cases. Rosso et al. (2023) examine the noise effects of SHM on a subspace structure using different machine learning algorithms. In their study, Mghazli et al. (2023) presented the optimized-based methodology for the selection of the optimal position of sensors in the application of SHM.

Similarly, a scalogram image-based health monitoring technique at the joint of steel frames was presented by Avci et al. (2020), Pal et al. (2022), Paral et al. (2020), Sharma and Sen (2020). In the study, the classification of undamaged and various damaged processes was achieved by the CNN algorithm. Additionally, the crack damage in the concrete structures was identified using the CNN algorithm (Cha and Choi 2017) and region-based (Cha et al. 2018) algorithm. The comparative study of FFT and wavelet transform was presented by Epp and Cha (2017) to identify internal damage to a concrete structure. In a study (Ta et al. 2022), presented corroded bolt loosening identification in a steel girder using a mask region-based CNN algorithm.

Considering the feasibility of training the classifiers with real-world data from the undamaged structure in addition to simulated data for the damage cases. The simulated data can be obtained experimentally using a downscaled laboratory model of the monitored structure or mathematically with an accurate finite element (FE) model. As a result, having damaged data from an otherwise undamaged structure would no longer be necessary (Avci et al. 2021; Bigoni and Hesthaven 2020; Pimentel et al. 2014). Using semi-supervised or unsupervised ML and DL algorithms, which may process limited label or fully labeled vibration data, is a further approach to solving the problem. There are few studies in the field of structural damage identification that use unsupervised techniques. A study (Wang and Cha 2021) suggested an unsupervised DL method that uses a deep auto-encoder and a one-class SVM with only measured acceleration response data from baseline structures as training data to spot future damage to structures. An unsupervised-based damage identification approach was presented by Cha and Wang (2018) to identify joint damage in a 3D frame structure using a density-peaks-based faster clustering algorithm. In the study, the crest factor and wavelet coefficients were extracted from acceleration data and used as input to the algorithm to classify the damage cases. A comparative study of unsupervised ML and DL algorithms was presented by Wang and Cha (2022) to identify the loosening bolts in a steel bridge.

Changes in environmental and operational factors affect the vibration properties of the structures. Changes in the modal parameters of steel buildings due to environmental effects have not been explored as much compared to bridge structures (Xia et al. 2012). Specifically, temperatures are important factors that affect the modal parameters of a steel frame structure. Usually, a 5–10% variation in the natural frequencies is to be observed daily and seasonally for bridge-type structures (Cornwell et al. 1999; Peeters and Roeck 2015). Kim et al. (2007) performed an experimental study on a model of a steel bridge. It was noticed that when the temperature rises by 1 °C, the first 4 frequencies decline by 0.64%, 0.33%, 0.44%, and 0.22%, respectively. Few studies on bridge structures that take temperature changes into account are available (Song and Dyke 2006; Xu and Zhishen 2007) as an origin of environmental changeability. Nayeri et al. (2008) proposed a damage detection technique for a 17-story steel 3D frame structure, and they observed a significant correlation between the variations in frequency and the variations in temperature over a few hours. Yuen and Kuok (2010) employed the Bayesian spectral density technique to find out the modal frequencies of a 22-story building for a year. The authors observed the first three frequency increases when increasing the room temperature, which was the reverse of their modeling analysis. In (Faravelli et al. 2011), noticed changes in the frequencies of the 600-m TV Tower in a day. As the temperature changes were about 3 °C, the frequency changes were about 0.5%.

From the literature, it is evident that numerous techniques have been studied to identify damage in plane frame structures as compared to 3D frame structures. Moreover, most studies are carried out without considering temperature variations. Therefore, it is made clear that the CNN-based DL for structural health monitoring at the connections of multi-story 3D steel frame structures under temperature variations is yet to be addressed. Hence, the present study is motivated by the need to develop an SHM technique for the identification of damage at the connection of steel 3D frame structures using scalogram images of vibration data under temperature variability. The contribution of the present study is as follows:

  1. 1.

    The development of a CNN-based SHM technique for the health monitoring of connections in a multi-story 3D framed structure.

  2. 2.

    The present study proposes the application of scalogram images for damage detection, localization, and severity of connection damage in a 3D frame structure.

  3. 3.

    The robustness of the technique for connection damage identification in a 3D frame structure is further verified through temperature variability.

2 CNN-based SHM technique

In the present work, an impact hammer is utilized to vibrate the structures and receive the time-history acceleration responses under undamaged and different damaged cases. The time-history acceleration responses are converted into frequency-domain scalogram images by employing the continuous wavelet transform (CWT) command in MATLAB. Later, the convolutional neural network is trained and tested with the scalogram image data set to classify the undamaged and different damaged cases. In this context, the location and severity of the damage are achieved under different temperature variations.

2.1 Wavelet analysis

The wavelet technique is the most popular tool in signal processing. It has been significantly utilized to find the discontinuity between two-time series signals (Yazdanpanah et al. 2020). Amidst the various wavelet transform methods, continuous wavelet transform is employed to extract the unique features that change over time, find the similar time-changing sequence in different signals, and accomplish time-confined filtering. Sudden changes in signals in the wavelet component have bigger arbitrary values. For a particular signal y(t) in the time history realm, a continuous wavelet transform is identified by integrating the multiplication of the signal and the complex conjugate of an original (mother) wavelet function.

$${W}_{m, n}=\int^{+\infty}_{-\infty}y \left(t\right){\varnothing }_{m,n}^{*}\left(y\right){\text{d}}t$$
(1)

Here, \({\varnothing }_{m,n}\) is a real or complex number function in the time and frequency realm, * represents the complex conjugation and it refers to the original wavelet denoted as;

$${\varnothing }_{m,n}\left(w\right)=\frac{1}{\sqrt{n}}\varnothing \left(\frac{w-m}{n}\right)$$
(2)

Here, the real numbers m and n represent the scale and transitional variable correspondingly.

In the wavelet transform, the translation variable n specifies the position of the moving wavelet window. The scale variable, m, indicates the width of the window. Because the wavelet transform works as a set of waves that are positioned in both the time and frequency realms, the continuous wave transform of the signal gives the time–frequency description, or scalogram, of the raw acceleration data. A scalogram image represents the absolute value of the CWT coefficient of data.

2.2 Image data set generation

The following steps are performed to create the datasets that are used to train and test the CNN-based SHM technique:

  • Step 1: Generate the time–frequency domain scalogram images from the time history acceleration data by performing wavelet analysis.

  • Step 2: Reduce the sizes of the scalogram images.

  • Step 3: Adding the different levels of noise to the images to generate a huge dataset.

The CNN-based deep learning technique for SHM of a 3D frame structure is developed to classify the different damage cases as given in Table 2. In the present work, the scalogram images are generated by performing a continuous wavelet transform of time-history acceleration responses that are collected under impact excitations. In this study, analytic Morse wavelet functions are utilized as original wavelets based on many trials. The RGB scalograms are stored as the image and as input to the CNN model. To train the CNN model, the data was collected under one undamaged and three different damaged cases. The model is validated after the training procedure and employed to predict the set classes for the input of test batch images.

It is observed from past studies that CNN requires a huge number of image datasets for classification (Han et al. 2018; Sharma et al. 2018). However, in the health monitoring domain, the experimental procedure can be repeated for limited trials, which produce a limited number of images. In the present work, each experimental case is repeated for 20 trials. In order to generate a huge dataset for training the model, the image augmentation process is applied by adding different levels of white Gaussian noise to raw experimental scalogram images.

2.3 CNN model

CNN is a type of deep neural network that is used for recognizing (Gopalakrishnan et al. 2017). CNN works and recognizes images in the same way that our brain does. The basic framework of the CNN model is shown in Fig. 1. The whole framework is divided into two primary phases: feature extraction and classification. To consider all significant aspects of the image, the size of the kernel and the number of kernel filters are taken into consideration in the current study as 8 × 8 pixels and 20, respectively. A stride, or moving in the “horizontal” and “vertical” directions, is considered one pixel. The image is sent to the feature selection layer, and then selected features are fed to the classification layer.

Fig. 1
figure 1

The basic framework of the CNN model

The output is made by the classification neural network, which works based on the image's features. The feature extraction neural network includes the sets convolution layer (CL) and the sets pooling layer (PL). The input image is transformed by a CL so that features can be extracted from it; this process is achieved by a kernel (or filter). A kernel is a small matrix whose height and width are less than those of a convolved image. The PL makes a single pixel out of the pixels that are adjacent to each other. Consequently, the PL decreases the dimension of the image. As the main purpose of a convolutional neural network is to process images, the CL and PL operate in a 2D plane. Mean pooling and max-pooling are the two types of pooling operations that may be performed. However, prior research indicates that for image processing, max-pooling performs better than mean-pooling. In the present study, the max pooling layer's sliding window size is [2 × 2] and the stride's sliding window size is 2, respectively. Max pooling chooses the sliding window's maximum value. This is one major difference between convolutional neural networks and other neural networks. After the PL operation, the classification process starts in a fully connected layer in the form of a linear transformation to the input vector through a weight matrix (Gao and Mosalam 2018; Kim et al. 2021). In the present research, SoftMax is utilized as the last layer of the CNN model, and it indicates the probability of each class and shows that a particular image corresponds to a specific class.

3 Experimental investigation

3.1 Experimental setup

For the validation of the proposed health monitoring technique, a five-story 3D steel frame structure is considered in the i4S experimental laboratory, IIT Mandi (HP), India, as shown in Fig. 2. Each component is made of grade 304 stainless steel. The frame specifications are given in Table 1. The ends of the beams and columns connect to incorporate the C-shaped joint arrangement with a bolted connection, as shown in Fig. 3 and the base of the frame is fixed with a C-clamp.

Fig. 2
figure 2

Experimental setup

Table 1 Physical parameters of the experimental model frame
Fig. 3
figure 3

Schematic view of beam-to-column connection

To vibrate the frame in a broad range of frequencies, the frame is horizontally vibrated by an impact hammer with a measurement range of \(\pm 2224{\text{N}}\), resonant frequency of \(\ge 22\,\mathrm{ kHz}\), and sensitivity of 2.25 Mv/N applied at the top of the frame, as shown in Fig. 4 and measured impact excitation depicted in Fig. 6. The piezoelectric accelerometers have the model number 7101A-0050, type IETE, measurement range 50 g, frequency range: 0.3/0.5–10,000 Hz, sensitivity: 100 mV/g, co-variance (g2): 7.73 × 10−8 are attached to the middle of the beam and column (Fig. 4). The DewesoftX data acquisition system is used to collect acceleration responses in the time domain. For collecting the excitation force and the horizontal acceleration response, a seven-channel data recording system is used. Specifically, the measurements are taken in a horizontal direction perpendicular to the members. The acceleration data were collected for various trials from each sensor. The damage is induced by the complete loosening of bolts at the (nodes) joints. In this context, the acceleration data was collected under undamaged Fig. 5a and different damaged Fig. 5b cases, as mentioned in Table 2, at a constant sampling frequency of 500.0 Hz.

Fig. 4
figure 4

Impact hammer location and accelerometer positions

Fig. 5
figure 5

Acceleration signal for a undamaged (und) and b damaged (dam3) cases

Fig. 6
figure 6

Force spectrum of undamaged case for a trial 1, b trial 2, and force spectrum of damaged (dam1) case for c trial 1, d trial 2

Table 2 Details of experimental cases

It has been found in the FFT curve of the force that its amplitude is up to 50 Hz, almost constant. Hence, the force measurement spectrum from the hammer is verified to be constant within the analysed frequencies (0–250 Hz). The Force spectrum for undamaged (und) and damaged (dam1) cases are shown in Fig. 6a–d, respectively. It has been found that the spectrum is similar for all the cases and for all trials. To maintain the consistency of the data, the data were normalized with respect to one particular excitation. Therefore, it is considered that the decaying of the force spectrum amplitude will not affect the outcome of the deep learning model.

Using the sensor data, the natural frequencies are identified by performing an FFT approach. Along with the commercial software ABAQUS, an FE model of the experimental setup has been developed. The 3D frame model and its mode shapes are shown in Fig. 7. This modelling aims to carry out modal analysis on the numerical replica to enable modal matching. The natural frequencies (Hz) of experimental and numerical studies for the undamaged case are given in Table 3.

Fig. 7
figure 7

3D frame model in ABAQUS and its various mode shapes

Table 3 Comparison of natural frequencies (Hz) for undamaged case

3.2 Temperature variability

In this section, both the localization and severity estimation of damages to the 3D frame structure are carried out under environmental changes using the CNN-based technique mentioned earlier. In this research work, temperature changes were considered the origin of environmental changes (Sohn 2007).

Some previous studies (Cornwell et al. 1999; Faravelli et al. 2011; Peeters and Roeck 2015) assumed either the material's density or Young's modulus as the most affected parameters due to the temperature variations. They observed that when the temperature changes by 3C, the frequency of steel tower structures changes by 0.5% on an hourly basis. For bridge-type structures, on a daily or seasonal basis, the frequency variation was 5–10%. In the present study, it is observed that there are changes in the natural frequency range from 0.7% to 2.95% per increasing 10 °C temperature. In this research, a different method (circular shifting) was proposed for creating synthetic data under temperature changes. It would be better to state here that, due to the constraints of lab equipment, tests at different temperatures cannot be done in the lab.

Initially, the time-domain data was changed to create frequency bands (Eq. 3) to determine the natural frequencies of the structure. Then, the first natural frequency was moved by + 0.50%, −0.50%, + 1%, −1%, −1.30%, and + 1.30%, which may encompass a broad range of temperature changes (± 9 °C). To figure out the procedure, an example of a + 1.3% shifting of frequency is to be taken, as shown in Fig. 8. Due to the shifting, some parts of the bands will extend outside 250 Hz, which was picked out and put at the beginning of the bands, as shown in Fig. 9. In this process, there is no variation in energy.

Fig. 8
figure 8

a FFT of actual experimental data and b FFT of shifted data.

Fig. 9
figure 9

Frequency shifting procedure

$${Z}_{c}\left(\omega \right)=\sqrt{\frac{2}{\pi }}\int^{\pi}_{0}z\left(t\right){\text{cos}}\left(\omega t\right){\text{d}}t$$
(3)

Consequently, the shifted frequency bands have been changed into the time domain by utilizing the inverse cosine transform using Eq. 4. This changed dataset can be taken as a synthetic experimental time-domain response of the structure at various temperatures. Like this, the dataset was created for the other cases as mentioned in the previous paragraph.

$$z\left(t\right)=\sqrt{\frac{2}{\pi }}\int^{\pi}_{0}{Z}_{c}\left(t\right){\text{cos}}\left(\omega t\right){\text{d}}\omega\quad t\ge 0,$$
(4)

As shown in Fig. 10a, it is observed that the original spectrum has a first natural frequency amplitude of 5.33075 Hz. When the spectrum is shifted by 1.3% Hz, the first natural frequency amplitude increases by 5.4519 Hz, which is nearly equal to the shift by 1.3% Hz of the original spectrum. This indicates that when the spectrum is shifted to the right side, its amplitudes increase. The comparison of original and shift signals is shown in Fig. 10b, which facilitates the visualization of variations in the (time-domain) original signal (blue) and the shifted signal (red). Moreover, it is also observed in the scalogram image Fig. 11b that shifting operations increase the amplitude of frequency peaks and intense colours as compared to Fig. 11a.

Fig. 10
figure 10

a Original and shifting spectrum (time and frequency domain), b comparison of original and shifted signals (time domain)

Fig. 11
figure 11

Scalogram images of a original and b shifted spectrum

3.3 Damage location and severity

After the dataset was created for the different temperature levels, the dataset for both damaged and undamaged cases was separated, as given in Table 2. After that, the original experimental data was utilized to create the training images, as previously mentioned, and the image dataset created from the experimental synthetic data was utilized to test the model. The localization and severity of the damage were achieved using the same classification approach that was explained earlier. In this work, 1 undamaged case and 3 different damaged cases were each treated as a separate class, and each class represents a particular location and severity of the damage. The output of the study was reported in the results and discussions section.

4 Results and discussions

For various configurations of the structure, the experimental acceleration data was taken from the seven accelerometers. The sensors are placed in the middle of the beam and column, as shown in Fig. 6. The acceleration signals for the undamaged (und) and damaged (dam3) cases are depicted in Fig. 6a, b. Likewise, the acceleration responses are acquired for 4 cases: und, dam1, dam2, and dam3. In all cases, acceleration time domain signals are received from all 7 sensors for 20 trials and changed to time–frequency domain scalogram images by performing a continuous wavelet transform in MATLAB, as shown in Fig. 12a, b. The time–frequency scalogram data in this study provides time as well as frequency information, serving as an extensive visual representation for the time and frequency-based features, whereas the time-history data in this study simply contains time-response information. For this research, the scalogram images are taken into account because the time–frequency image has more features than the raw time-history data (Han et al. 2018; Sharma et al. 2018). For each experimental trial, 7-scalogram images have been produced, and for each configuration, 7 × 20 = 140 scalograms are obtained. The size of the colour image is found [876 × 656 × 3] (length × height × number of channels). To minimize the computational work, the dimension of the image is minimized to [224 × 224 × 3] pixels by performing the imresize function in MATLAB.

Fig. 12
figure 12

Scalogram images of undamaged (und) and damaged (dam3) respectively

Additionally, as shown in Fig. 13, employing the imnoise function in MATLAB and adding Gaussian noise with a zero mean and different variance to the reduced images results in an image augmentation process. The Gaussian noise variance has values between 0.01 and 1 (Shijie et al. 2017; Wang et al. 2016). In this study, the variance sets taken are 0.0001, 0.001, and 0.01.

Fig. 13
figure 13

Scalogram image development, reduced image, and augmented image procedure

The study utilised a typical random uniform noise approach to add to the original resized image dataset (\(y\)). The process of creating a new dataset (\(\overline{y }\)) from the original dataset is defined mathematically by Eq. 5.

$$\overline{y }=y+{\text{RND}} (0.0001, 0.01)\times {{\text{Noise}}}_{i}$$
(5)

To include noise, a level of noise multiplied by a random uniform number (RND) in the range (0.0001, 0.01) and a model parameter \({{\text{Noise}}}_{i},\) provide an adjustable option for the amount of noise introduced during the data augmentation procedure (Moreno-Barea et al. 2018).

In the present work, among 7 sensors (20 trials), 6 sensors (20 trials) raw images are utilized for training and validation, and 1 sensor (20 trials) raw images are utilized for testing the CNN model (the positions of the sensors are shown in Fig. 4). Further, the image augmentation is carried out by adding zero-man Gaussian noise, as given in Table 4. For each individual case, slight variation will be there due to various disturbances during the experiment. Hence, for each individual case, slight variation will be observed in the data. However, when the structures are in different condition, then there will be significant variation in the data compared other conditions which will produce different sets of data. Therefore, to train the CNN model, each case was repeated for 20 times which will ensure small variations in the data for an individual case and large variations among the cases.

Table 4 Distribution of image dataset for training, validation, and testing

The training of the model is stopped after 14,400 iterations with 80 epochs, and the validation accuracy of the model is found to be 94.38%, as shown in Fig. 14.

Fig. 14
figure 14

Graph of the accuracy and loss of the CNN model

A tenfold cross-validation test is also utilized to get a confidence level of accuracy for the model, and the average results of each class are presented in the confusion matrix (CM) in Table 5. In the study, the testing results are computed as (0.823 + 0.818 + 0.973 + 0.99) × 100/4 = 90.1%. From the results, it is observed that the developed CNN-based SHM technique can differentiate between undamaged and different damaged cases with a testing accuracy of 90.1%. The CM matrix’s diagonal elements (bold) represent the number of accurately classified cases for each class.

Table 5 CM for average tenfold classification testing results

Classes dam1, dam2, dam3, and und are labelled in this matrix as actual cases (rows) and predictions (columns). The normalised values are presented in a CM to represent the percentage of predictions for each actual class. Intersection over Union (IoU) is often employed in the metric image classification process. These sets represent a class's true and expected labels. The following is the formula for IoU:

$${\text{IoU}}=\frac{{\text{TP}}}{{\text{TP}}+{\text{FP}}+{\text{FN}}}$$
(6)

where TP is the true positive (the number of actual predictions for a class). FP (false positives) is the number of incorrect predictions or occurrences for a class. FN (false negatives) is the number of occurrences that belong to a class but belong to a different class.

Considered the given matrix (Table 5) as an example for one class (dam1) to illustrate this:

  • TP for dam1 = 0.823 (diagonal element of dam1).

  • FP for dam1 = sum of the dam1 column (0.823 + 0.171 + 0.011 + 0)—TP of dam1 (0.823) = 0.182.

  • FN for dam1 = sum of the dam1 row (0.823 + 0.108 + 0.064 + 0.005)—TP of dam1 (0.823) = 0.177

    $$\mathrm{IoU}\,\mathrm{score}\,\mathrm{for}\,\mathrm{dam}1=\frac{0.823}{0.823+0.182+0.177}=0.696$$

Similarly, IoU scores for each class.

  • IoU scores for dam2 = 0.728.

  • IoU scores for dam3 = 0.901.

  • IoU scores for und = 0.980

    $$\mathrm{Mean}\,\mathrm{IoU}\,\mathrm{of}\,\mathrm{all}\,\mathrm{classes}=\frac{\left(0.696+0.728+0.901+0.980\right)}{4}=0.826$$

Further, the robustness of the developed CNN-based technique is examined by identifying the location and severity of the damage under temperature variability. For that purpose, four classes (und, dam1, dam2, and dam3) are considered. As explained in Sect. 3, six different temperature changes were considered, and the same network was tested for temperature variation (± 9 °C temperature variation can be identified). The average tenfold classification testing results are given in Table 6, and the testing accuracy is 82.8%.

Table 6 CM for average tenfold classification testing results under temperature variability

Additionally, to check the effectiveness of the developed CNN-based technique, the floor level of the joint damage is identified. For that purpose, four classes (und2, dam4, dam5, and dam6) are considered, as given in Table 2. To meet this objective, the scalogram images of the four cases are fed into the pre-trained model. The average testing results are given in Table 7, and the testing accuracy is 87.8%. This implies that if a damage case is near a particular training class, it can be classified according to its closest class. This clearly shows that the technique can detect the floor level of joint damage even with the data for which the network has not been trained.

Table 7 CM for average tenfold classification testing results of the pre-trained model

Furthermore, a comparative study is performed by comparing the results of the CNN model with the AlexNet model. The AlexNet model also provides better classification accuracy. The complete overview of the AlexNet model is presented in a study (Amanollah et al. 2023). To illustrate how effectively the AlexNet model performed in the classification process, as shown in Fig. 15.

Fig. 15
figure 15

Graph of the accuracy and loss of the AlexNet model

The training of the model is stopped after 1680 iterations with 20 epochs, and the training and validation accuracy of the model is found to be 95.63% and 95.33%, as shown in Fig. 15a. The average testing results of each class are presented in the confusion matrix (CM) in Table 8.

Table 8 CM for average tenfold classification testing results of AlexNet model

From the results, it is observed that the AlexNet model can differentiate between undamaged and different damaged cases with a testing accuracy of 94.375%.

The robustness of the AlexNet model is examined by identifying the location and severity of the damage under temperature variability with a testing accuracy of 90.25% as given in Table 9.

Table 9 CM for average tenfold classification testing results under temperature variability

To check the effectiveness of the AlexNet model, the floor level of the joint damage is identified with a testing accuracy of 92.5% as given in Table 10.

Table 10 CM for average tenfold classification testing results of the pre-trained model

5 Conclusions

In the present study, a health monitoring technique for joint damage in a 3D frame structure using CNN is developed. For that purpose, a five-story 3D steel building frame is considered. The robustness and effectiveness of the technique for the detection, localization, and severity of damage were examined for different temperature conditions and with the unseen data collected from the different joints at the same floor level.

  • The average training and validation accuracy is found to be 100% and 94.3%, respectively, whereas the testing accuracy is 90.1%, which indicates that the technique can differentiate between undamaged and damaged cases.

  • The study considers the shifting of natural frequencies (+ 0.50%, −0.50%, + 1%, −1%, + 1.3%, and −1.30%) as the cause of temperature changes that may cover (± 9 °C) variations. The results show that with these variations, the technique can classify the cases with 82.8% accuracy.

  • The floor level of the joint damage was also successfully identified with an average testing accuracy of 87.8% using unseen images that were not even used for training.

  • The average training and validation accuracies of the AlexNet model are found to be 95.63% and 95.33%, respectively, whereas the testing accuracy is 94.37%, which indicates that the technique can differentiate between undamaged and damaged cases.

  • The testing accuracy of the AlexNet model under consideration of temperature variation is found to be 90.25%, whereas the floor level of the joint damage was identified with a testing accuracy of 92.25%.

  • The IoU scores for every class show a strong classification performance of the model between the classes while considering false positives and false negatives. Based on their IoU scores, it is observed that ‘dam3’ and ‘und’ classes show high separability and low confusion with other classes.

  • The results indicate that the proposed technique has the potential for the development of an industry-grade automation tool for the SHM of connections in 3D frame structures and will significantly contribute to the field of research.

  • The study further emphasizes that before the technique is generalized, the effects of the operational and environmental variables must be validated with real-structure data, which is ongoing research by the authors.