Keywords

18.1 Introduction

In an effort to ensure the safety and integrity of operating civil, mechanical, and aerospace structures, and prevent catastrophic failures, various SHM techniques have been developed over the past few decades [1,2,3,4]. Among these methods, vibration-based SHM have been recognized as some of the most important and effective approaches for health assessment of engineering structures. In experimental studies that use these vibration-based approaches, sensors are attached at selected locations on a structure to measure the vibration response, and the measured data is then post-processed to extract information regarding the structural dynamics (i.e., natural frequencies, mode shapes, damping, etc.). However, while many of these approaches have proven effective in particular scenarios, their performance in large-scale structures is still limited [5,6,7]. Key reasons of this limitation are the sparsity of measurements and the inability to resolve anomalous responses between measurement locations.

As an alternative to sparse measurements from single-point sensing, full-field measurements enable the continuous description of structural response, and when extended to full-field measurement, the critical linkage between global and local behavior can be realized [8,9,10,11,12]. Although full-field vibration measurements could be achieved by using a large array of accelerometers or alternatively using a roving measurement scheme, the large array itself would potentially add significant extra mass for certain engineering structures and likely be cost-prohibitive, whereas the roving sensing scheme does not provide synchronous measurements and also requires a significant amount of data acquisition time.

Optical sensing techniques such as scanning laser Doppler vibrometry, digital speckle shearography, and electronic speckle pattern interferometry are desirable for full-field and noncontact measurements, and have the potential to overcome the limitations of measurement sparsity existed in traditional sensing [13, 14]. However, scanning laser Doppler vibrometers are limited to sequentially measuring structural vibration point by point and vulnerable to rigid body motion and asynchronism [15]. Recently, very expensive continuously scanning laser Doppler vibrometers have been developed to address the lengthy acquisition time of this technique, but this technique is only able to measure vibration of a line of structure, and the results depend on the scanning frequency of the line. Both digital speckle shearography and electronic speckle pattern interferometry are limited to single-axis measurements and are also only able to capture relative displacements, thus not effective for large-scale SHM applications [16]. Although these techniques have specific limitations for SHM, they do shed insight into the potential and requirement for a feasible full-field noncontact measurement system. The ideal solution for robust vibration-based SHM ultimately requires a noninvasive nature that is capable of providing 3D quantitative measurements of vibration responses over a large area, thus eliminating the aforementioned disadvantages while simultaneously maintaining the critical global-local relationship.

Advances in imaging sensors such as high-speed cameras with high resolution and processing techniques such as digital image correlation [13, 17] and phase-based motion magnification [18] have created a path for enabling the measurement of full-field structural vibration response, an approach that is potentially transformative for SHM applications. Currently, the deployment of image-based characterization of structural systems is in the nascent stages, and the use of image-based vibration techniques for modal analysis via machine learning is still under investigation. The current research focuses on the formulation of a novel end-to-end deep learning framework for modal identification in engineering structures as a noncontact, full-field approach. These activities are transformative and have the potential to facilitate the use of completely new SHM techniques with advantages including low cost, full-field sensing, high resolution, zero mass loading, immunity to electromagnetic field, and applicability to a wide range of civil, mechanical, and aerospace engineering structures.

The fundamental of a 2D CNN has multiple convolutions, activation, and pooling layers followed by the fully connected and classification layers. An overview of the CNN is presented in Fig. 18.2a. The first part of the network consists of a number of convolution layers. Each convolution layer is followed by an activation function termed as rectified linear unit (ReLU) for activation. Additionally, a pooling layer (average pooling or max pooling) may follow the convolution layers. The second part of the network is similar to a multilayer feedforward neural network (ANN) with fully connected hidden layers followed by a regression or classification layer. The primary differences between multilayer feedforward neural networks (MLFFNNs) and convolutional neural networks (CNNs) are the convolution and pooling layers. Input layer is a 2D matrix (or represented as a 2D image), and it may have multiple components in each entry, such as a RGB array. Therefore, the input layer accepts a fixed-size grid of I×J pixels as shown in Fig. 18.1a. Convolution layers extract spatial features from input images, and a convolution layer consists of a set of small receptive fields known as filters or kernels which operate on subarrays of equal size from the input data. The weight values for the kernels are typically initialized with random values and updated during training by an optimizer, such as Adam or stochastic gradient descent (SGD) algorithm. These weights may also be initialized with pertained values in a process known as fine-tuning or transfer learning. A dot product is computed between the kernels and the subarrays of the input data. The multiplied values are summed, and a bias value is added. A diagram of a sample convolution process is shown in Fig. 18.1b, and in the figure, center element of the filter/kernel is placed over the source pixel. The source pixel is then replaced with a weighted sum of itself and nearby pixels. To operate on the entire input area, the kernels are moved over the input data. The number of indices moved for each new calculation is known as the stride. A smaller stride increases the number of features extracted but is also more computationally expensive. Pooling layers of CNNs are used to downsample the data and decrease the computational cost of a network. To downsample the data, each subarray of the data is condensed into a single value using either max or mean pooling. In max pooling, the greatest value in the subarray becomes the new value for the subarray, while in mean pooling, the values of the subarray are averaged to generate the new value. As with the convolution layer, the stride of a pooling layer determines how many indices to shift to perform the next pooling operation on a new subarray [19,20,21,22,23].

Fig. 18.1
figure 1

Overview of 2D CNNs. (a) Illustration of a CNN with its components, including input layer, convolution layers, max pooling, etc. (b) Illustration of the convolution operation, in which the input is a grayscale image after blurring and downsampling

18.2 Research Perspectives

This study focuses on the use of a computer vision-based approach leveraged from a machine learning technique to identify dynamic properties of engineering structures. Particularly, the goal of this paper is to formulate and validate a noncontact image-based SHM framework for engineering structures using deep learning strategy for modal frequency identification. The outcome is introducing a technique that allows for an automated noninvasive image-based monitoring approach to SHM, a strategy that would enable true in situ assessment of engineering structures.

As one of the latest advancements in deep learning, attention mechanisms have been widely used in machine translation, image captioning, and many other applications. In learning-based video analytics, attention mechanisms also show great power, either by temporally focusing on different frames of a video or by spatially focusing on different parts of a frame. In this work, an algorithm acquiring spatiotemporal attention from images to detect motion is proposed, and the decomposition filter will be learned directly from examples using deep learning-based strategy named DeepDyn. The proposed idea not only optimizes the filter design for video processing in a fully automated way without any human intervention but also has capability of generalizing and transferring those learned information to more complex structures. As for structure vibration videos demonstrated in this work, frequency is considered as one of the features which the DeepDyn model could identify. After different experiments with various architectures and hyper parameters, the final model consists of a GoogLeNet-like network with two Inception modules for the CNN part, spatio-feature channel attention (SFCA) blocks to pay attention to key spatial filters, a ConvLSTM layers with 32 hidden units each and temporal attention layer to focus on key frames. At the end of the ConvLSTM layers, temporal feature attention (TFA) is available to learn and focus on crucial temporal features which belong to different frames. Overall, this network architecture helps to learn an objective function for those data with spatiotemporal correlations (Fig. 18.2).

Fig. 18.2
figure 2

Comprehensive details of the proposed end-to-end deep learning framework to extract structural resonance frequencies

Fig. 18.3
figure 3

(a) The comparison of validation loss functions for different modulus in the network. (b) The MAE values, over the validation set. (c) The regression plot for DeepDyn model’s capability of robustness for generalizability

18.3 Performance Matrix

18.3.1 Evaluation Metric and Generalizability

The proposed deep learning network shown in Fig. 18.2 is trained based on the existing base generic dataset defined in the previous section [24]. The generalizability is referred to as the ability of a model to perform well on unseen datasets within its training input domain range. In this paper, MAE (mean absolute error) is adopted as the evaluation criteria for proposed deep learning architecture, which is also the loss function for all deep learning architectures during training. MAE is the quantification of the absolute error between the prediction and ground truth values, and is expressed by Eq. (18.1):

$$ \mathrm{MAE}=\frac{\sum_{i=1}^n\left|{\hat{y}}_i-{y}_i\right|}{n} $$
(18.1)

where \( \hat{y} \) is the predicted value, y is the ground truth value, and n is the sample size of the dataset. Here the MAE values between predicted natural frequency and real natural frequency for six different beams are chosen to evaluate the model performance. The Adam optimizer is employed to train the networks given its superiority over other stochastic optimization methods. The learning rate is kept as 0.0001 and the total number of training epochs is set to be 400. The validation dataset here helps monitor overfitting during training. The trained network with the lowest validation loss is saved for fine-tuning on different dataset (prediction/testing). To demonstrate the effectiveness of the different modulus, the comparison of validation loss functions is shown in Fig. 18.3a. It is seen that the attention blocks clearly improve the validation accuracy. Figure 18.3b outlines the MAE value of the developed model. The mean value for MAE ranges between 0.45 and 1.5, which stands witness to the superior performance of the model. Figure 18.3b indicates the model’s performance over the test dataset. These figures demonstrate that the prediction precisely traces the ground truth. As for the different network configurations, the proposed model has the best prediction results for the DeepDyn with the mean value of MAE being 0.4. The worst performance is for the CNN, where the model registers a mean MAE value of 70.

18.3.2 Extrapolability and Transfer Learning

After testing on the generic dataset available in [24], the extrapolability of the trained network is investigated based on videos coming from other lab experiments (e.g., the vibration objects are different). Here, the videos recording the vibration of a turbine structure [25] are chosen for the validation study. Note that inference by the trained network is performed to extract modal frequencies for these new testing videos, where transfer learning for fine-tuning is required. To verify the extrapolability of the proposed model, a game changer strategy is executed to show the capability of the DeepDyn in generalizing the model to the dataset other than the one used in training dataset. The video data of a wind turbine [25, 26] is further used to verify the extrapolability of the trained network for extracting the resonant frequencies. To implement the abovementioned strategy, the following steps are established. Firstly, the model is trained based on the generic dataset [24] for three first modal frequencies, and then the transfer learning is conducted for fine-tuning the model for new dataset for nine first modal frequencies. Thirdly, the fine-tuned model is tested for the first generic dataset to obtain the rest of the nine sets of the frequencies. It can be observed that the trained network successfully captures the modal properties and predicts them closely to the ground truth in general, as presented in Fig. 18.4. This illustrative case, as well as previous cases, demonstrates that the trained DeepDyn network is transferable and generalizable to extraction of modal frequency properties from unseen videos.

Fig. 18.4
figure 4

Results of DeepDyn model (transferring based on front view video frames and test on top view video frames) and ground truth results of natural frequencies (Hz) for the turbine structure

18.4 Conclusion

This paper proposes a novel attention-based neural network model named as DeepDyn model to fully mine spatiotemporal information of images. The natural frequencies of the structures are extracted from dissimilar structures and extrapolated to higher modes even when trainings are not available. In this model, CNN and attention mechanisms are used to extract the most relevant visual spatiotemporal features of the sequences of the images. The attention mechanism is further combined with ConvLSTM to mine the correlation between the spatial information of the visual features. Experimental results indicate that the proposed method obtains better performance than most existing image-based modal analysis methods, and confirms the effectiveness and advantage of video-based system dynamics identifications. The experimental results demonstrate that compared with the general CNN-RNN model, the proposed architecture can effectively predict modal information from the sequence of images on the dataset. Overall, the trained networks have potential to enable the monitoring of structural vibration in real time. The focus of the future study will be placed on application and validation of this technique on in situ structures, meanwhile enriching the training datasets to enhance the inference ability of the trained network.