Abstract
Current computer video-based vibration modal analysis approaches typically decompose video frames into representations and then adjust them that allow to magnify motions to extract motion representations for vibration modal analysis. Their decomposition usually relies upon handcrafted designed kernels, such as the complex steerable kernels, which typically may not be optimally designed for the extraction of subtle motions specially in higher frequency domains. In this paper, optimal decomposition kernel is learned and designed directly from baseline dataset images using deep convolutional neural network (CNN) models. Each subpixel of an image obtained from a digital camera is included when computing the spatiotemporal information, which serves similar to an individual motion sensor to acquire the modal frequencies of a vibrating structure. A hybrid-attention-LSTM-based deep convolutional neural network architecture is developed to take advantage of attention and LSTM blocks to discover subtle motions from a specific source to visualize high resolution of dynamic properties of the structures in the existence of high amounts of noise. The idea of transfer learning is utilized to transfer the knowledge previously learned to new limited dataset. Transfer learning is used to take advantage of limited existing dataset to avoid underfitting in the training of the network, considering the current publicly available modal frequency datasets are insufficient to train a generalized network. The proposed deep learning architecture is designed in such a way that has capability of transferring the trained model from baseline dataset on a simple structure to a complicated structure using transfer learning perspective. After training, the model takes the video of a vibrating structure as input and outputs the fundamental modal frequencies. By showing reliable empirical results, the proposed model is autonomous, efficient, and accurate.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
18.1 Introduction
In an effort to ensure the safety and integrity of operating civil, mechanical, and aerospace structures, and prevent catastrophic failures, various SHM techniques have been developed over the past few decades [1,2,3,4]. Among these methods, vibration-based SHM have been recognized as some of the most important and effective approaches for health assessment of engineering structures. In experimental studies that use these vibration-based approaches, sensors are attached at selected locations on a structure to measure the vibration response, and the measured data is then post-processed to extract information regarding the structural dynamics (i.e., natural frequencies, mode shapes, damping, etc.). However, while many of these approaches have proven effective in particular scenarios, their performance in large-scale structures is still limited [5,6,7]. Key reasons of this limitation are the sparsity of measurements and the inability to resolve anomalous responses between measurement locations.
As an alternative to sparse measurements from single-point sensing, full-field measurements enable the continuous description of structural response, and when extended to full-field measurement, the critical linkage between global and local behavior can be realized [8,9,10,11,12]. Although full-field vibration measurements could be achieved by using a large array of accelerometers or alternatively using a roving measurement scheme, the large array itself would potentially add significant extra mass for certain engineering structures and likely be cost-prohibitive, whereas the roving sensing scheme does not provide synchronous measurements and also requires a significant amount of data acquisition time.
Optical sensing techniques such as scanning laser Doppler vibrometry, digital speckle shearography, and electronic speckle pattern interferometry are desirable for full-field and noncontact measurements, and have the potential to overcome the limitations of measurement sparsity existed in traditional sensing [13, 14]. However, scanning laser Doppler vibrometers are limited to sequentially measuring structural vibration point by point and vulnerable to rigid body motion and asynchronism [15]. Recently, very expensive continuously scanning laser Doppler vibrometers have been developed to address the lengthy acquisition time of this technique, but this technique is only able to measure vibration of a line of structure, and the results depend on the scanning frequency of the line. Both digital speckle shearography and electronic speckle pattern interferometry are limited to single-axis measurements and are also only able to capture relative displacements, thus not effective for large-scale SHM applications [16]. Although these techniques have specific limitations for SHM, they do shed insight into the potential and requirement for a feasible full-field noncontact measurement system. The ideal solution for robust vibration-based SHM ultimately requires a noninvasive nature that is capable of providing 3D quantitative measurements of vibration responses over a large area, thus eliminating the aforementioned disadvantages while simultaneously maintaining the critical global-local relationship.
Advances in imaging sensors such as high-speed cameras with high resolution and processing techniques such as digital image correlation [13, 17] and phase-based motion magnification [18] have created a path for enabling the measurement of full-field structural vibration response, an approach that is potentially transformative for SHM applications. Currently, the deployment of image-based characterization of structural systems is in the nascent stages, and the use of image-based vibration techniques for modal analysis via machine learning is still under investigation. The current research focuses on the formulation of a novel end-to-end deep learning framework for modal identification in engineering structures as a noncontact, full-field approach. These activities are transformative and have the potential to facilitate the use of completely new SHM techniques with advantages including low cost, full-field sensing, high resolution, zero mass loading, immunity to electromagnetic field, and applicability to a wide range of civil, mechanical, and aerospace engineering structures.
The fundamental of a 2D CNN has multiple convolutions, activation, and pooling layers followed by the fully connected and classification layers. An overview of the CNN is presented in Fig. 18.2a. The first part of the network consists of a number of convolution layers. Each convolution layer is followed by an activation function termed as rectified linear unit (ReLU) for activation. Additionally, a pooling layer (average pooling or max pooling) may follow the convolution layers. The second part of the network is similar to a multilayer feedforward neural network (ANN) with fully connected hidden layers followed by a regression or classification layer. The primary differences between multilayer feedforward neural networks (MLFFNNs) and convolutional neural networks (CNNs) are the convolution and pooling layers. Input layer is a 2D matrix (or represented as a 2D image), and it may have multiple components in each entry, such as a RGB array. Therefore, the input layer accepts a fixed-size grid of I×J pixels as shown in Fig. 18.1a. Convolution layers extract spatial features from input images, and a convolution layer consists of a set of small receptive fields known as filters or kernels which operate on subarrays of equal size from the input data. The weight values for the kernels are typically initialized with random values and updated during training by an optimizer, such as Adam or stochastic gradient descent (SGD) algorithm. These weights may also be initialized with pertained values in a process known as fine-tuning or transfer learning. A dot product is computed between the kernels and the subarrays of the input data. The multiplied values are summed, and a bias value is added. A diagram of a sample convolution process is shown in Fig. 18.1b, and in the figure, center element of the filter/kernel is placed over the source pixel. The source pixel is then replaced with a weighted sum of itself and nearby pixels. To operate on the entire input area, the kernels are moved over the input data. The number of indices moved for each new calculation is known as the stride. A smaller stride increases the number of features extracted but is also more computationally expensive. Pooling layers of CNNs are used to downsample the data and decrease the computational cost of a network. To downsample the data, each subarray of the data is condensed into a single value using either max or mean pooling. In max pooling, the greatest value in the subarray becomes the new value for the subarray, while in mean pooling, the values of the subarray are averaged to generate the new value. As with the convolution layer, the stride of a pooling layer determines how many indices to shift to perform the next pooling operation on a new subarray [19,20,21,22,23].
18.2 Research Perspectives
This study focuses on the use of a computer vision-based approach leveraged from a machine learning technique to identify dynamic properties of engineering structures. Particularly, the goal of this paper is to formulate and validate a noncontact image-based SHM framework for engineering structures using deep learning strategy for modal frequency identification. The outcome is introducing a technique that allows for an automated noninvasive image-based monitoring approach to SHM, a strategy that would enable true in situ assessment of engineering structures.
As one of the latest advancements in deep learning, attention mechanisms have been widely used in machine translation, image captioning, and many other applications. In learning-based video analytics, attention mechanisms also show great power, either by temporally focusing on different frames of a video or by spatially focusing on different parts of a frame. In this work, an algorithm acquiring spatiotemporal attention from images to detect motion is proposed, and the decomposition filter will be learned directly from examples using deep learning-based strategy named DeepDyn. The proposed idea not only optimizes the filter design for video processing in a fully automated way without any human intervention but also has capability of generalizing and transferring those learned information to more complex structures. As for structure vibration videos demonstrated in this work, frequency is considered as one of the features which the DeepDyn model could identify. After different experiments with various architectures and hyper parameters, the final model consists of a GoogLeNet-like network with two Inception modules for the CNN part, spatio-feature channel attention (SFCA) blocks to pay attention to key spatial filters, a ConvLSTM layers with 32 hidden units each and temporal attention layer to focus on key frames. At the end of the ConvLSTM layers, temporal feature attention (TFA) is available to learn and focus on crucial temporal features which belong to different frames. Overall, this network architecture helps to learn an objective function for those data with spatiotemporal correlations (Fig. 18.2).
18.3 Performance Matrix
18.3.1 Evaluation Metric and Generalizability
The proposed deep learning network shown in Fig. 18.2 is trained based on the existing base generic dataset defined in the previous section [24]. The generalizability is referred to as the ability of a model to perform well on unseen datasets within its training input domain range. In this paper, MAE (mean absolute error) is adopted as the evaluation criteria for proposed deep learning architecture, which is also the loss function for all deep learning architectures during training. MAE is the quantification of the absolute error between the prediction and ground truth values, and is expressed by Eq. (18.1):
where \( \hat{y} \) is the predicted value, y is the ground truth value, and n is the sample size of the dataset. Here the MAE values between predicted natural frequency and real natural frequency for six different beams are chosen to evaluate the model performance. The Adam optimizer is employed to train the networks given its superiority over other stochastic optimization methods. The learning rate is kept as 0.0001 and the total number of training epochs is set to be 400. The validation dataset here helps monitor overfitting during training. The trained network with the lowest validation loss is saved for fine-tuning on different dataset (prediction/testing). To demonstrate the effectiveness of the different modulus, the comparison of validation loss functions is shown in Fig. 18.3a. It is seen that the attention blocks clearly improve the validation accuracy. Figure 18.3b outlines the MAE value of the developed model. The mean value for MAE ranges between 0.45 and 1.5, which stands witness to the superior performance of the model. Figure 18.3b indicates the model’s performance over the test dataset. These figures demonstrate that the prediction precisely traces the ground truth. As for the different network configurations, the proposed model has the best prediction results for the DeepDyn with the mean value of MAE being 0.4. The worst performance is for the CNN, where the model registers a mean MAE value of 70.
18.3.2 Extrapolability and Transfer Learning
After testing on the generic dataset available in [24], the extrapolability of the trained network is investigated based on videos coming from other lab experiments (e.g., the vibration objects are different). Here, the videos recording the vibration of a turbine structure [25] are chosen for the validation study. Note that inference by the trained network is performed to extract modal frequencies for these new testing videos, where transfer learning for fine-tuning is required. To verify the extrapolability of the proposed model, a game changer strategy is executed to show the capability of the DeepDyn in generalizing the model to the dataset other than the one used in training dataset. The video data of a wind turbine [25, 26] is further used to verify the extrapolability of the trained network for extracting the resonant frequencies. To implement the abovementioned strategy, the following steps are established. Firstly, the model is trained based on the generic dataset [24] for three first modal frequencies, and then the transfer learning is conducted for fine-tuning the model for new dataset for nine first modal frequencies. Thirdly, the fine-tuned model is tested for the first generic dataset to obtain the rest of the nine sets of the frequencies. It can be observed that the trained network successfully captures the modal properties and predicts them closely to the ground truth in general, as presented in Fig. 18.4. This illustrative case, as well as previous cases, demonstrates that the trained DeepDyn network is transferable and generalizable to extraction of modal frequency properties from unseen videos.
18.4 Conclusion
This paper proposes a novel attention-based neural network model named as DeepDyn model to fully mine spatiotemporal information of images. The natural frequencies of the structures are extracted from dissimilar structures and extrapolated to higher modes even when trainings are not available. In this model, CNN and attention mechanisms are used to extract the most relevant visual spatiotemporal features of the sequences of the images. The attention mechanism is further combined with ConvLSTM to mine the correlation between the spatial information of the visual features. Experimental results indicate that the proposed method obtains better performance than most existing image-based modal analysis methods, and confirms the effectiveness and advantage of video-based system dynamics identifications. The experimental results demonstrate that compared with the general CNN-RNN model, the proposed architecture can effectively predict modal information from the sequence of images on the dataset. Overall, the trained networks have potential to enable the monitoring of structural vibration in real time. The focus of the future study will be placed on application and validation of this technique on in situ structures, meanwhile enriching the training datasets to enhance the inference ability of the trained network.
References
Sohn, H., et al.: A review of structural health monitoring literature: 1996–2001, p. 1. Los Alamos National Laboratory, Washington, DC (2003)
Chang, P.C., Flatau, A., Liu, S.: Health monitoring of civil infrastructure. Struct. Health Monit. 2(3), 257–267 (2003)
Worden, K., et al.: The fundamental axioms of structural health monitoring. Proc. R. Soc. A Math. Phys. Eng. Sci. 463(2082), 1639–1664 (2007)
Webb, G., Vardanega, P.J., Middleton, C.R.: Categories of SHM deployments: Technologies and capabilities. J. Bridge Eng. 20(11), 04014118 (2015)
Montalvao, D., Maia, N.M.M., Ribeiro, A.M.R.: A review of vibration-based structural health monitoring with special emphasis on composite materials. Shock Vib. Dig. 38(4), 295–324 (2006)
Schubel, P., et al.: Review of structural health and cure monitoring techniques for large wind turbine blades. Renew. Energy. 51, 113–123 (2013)
Goyal, D., Pabla, B.: The vibration monitoring methods and signal processing techniques for structural health monitoring: a review. Arch. Comput. Method Eng. 23(4), 585–594 (2016)
Dizaji, M., Alipour, M., Harris, D.: Subsurface damage detection and structural health monitoring using digital image correlation and topology optimization. Eng. Struct. 230, 111712 (2021)
Dizaji, M.S., Alipour, M., Harris, D.: Leveraging full-field measurement from 3D digital image correlation for structural identification. Exp. Mech. 58(7), 1049–1066 (2018)
Dizaji, M.S., Alipour, M., Harris, D.: Image-based tomography of structures to detect internal abnormalities using inverse approach. Exp. Tech., 1–16 (2021)
Dizaji, M.S., et al.: En“vision”ing a novel approach for structural health monitoring – a model for full-field structural identification using 3D digital image correlation. In The 8th international conference on structural health monitoring of intelligent infrastructure. Brisbane, Australia (2017)
Dizaji, M.S., et al.: Full-field non-destructive image-based diagnostics of a structure using 3D digital image correlation and laser scanner techniques. J. Civil Struct. Health Monit. 11(5), 1415–1428 (2021)
Schmidt, T., Tyson, J., Galanulis, K.: Full-field dynamic displacement and strain measurement-specific examples using advanced 3D image correlation. Photogrammetry: Part II. Exp. Tech. 27(4), 22–26 (2003)
Kohut, P., et al.: Noncontact measurements for vibration-based SHM and NDE. Hindawi (2019)
Stanbridge, A.B., Martarelli, M., Ewins, D.J.: Measuring area vibration mode shapes with a continuous-scan LDV. Measurement. 35(2), 181–189 (2004)
Huntley, J.M.: Automated fringe pattern analysis in experimental mechanics: a review. J. Strain Anal. Eng. Des. 33(2), 105–125 (1998)
Southwick, M., Mao, Z., Niezrecki, C.: A complex convolution kernel-based optical displacement sensor. IEEE Sens. J. 20(17), 9753–9762 (2020)
Wadhwa, N., et al.: Phase-based video motion processing. ACM Trans. Gr (TOG). 32(4), 1–10 (2013)
Koushik, J.: Understanding convolutional neural networks. arXiv preprint arXiv:1605.09081 (2016)
Ketkar, N.: Convolutional neural networks. In: Deep learning with Python, pp. 63–78. Springer (2017)
O’Shea, K., Nash, R.: An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458 (2015)
Vedaldi, A., Lenc, K.: Matconvnet: convolutional neural networks for matlab. In: Proceedings of the 23rd ACM international conference on Multimedia (2015)
Dizaji, M.S., Harris, D.K.: 3D InspectionNet: a deep 3D convolutional neural networks based approach for 3D defect detection on concrete columns. In: Nondestructive characterization and monitoring of advanced materials, Aerospace, civil infrastructure, and transportation XIII. International Society for Optics and Photonics (2019)
Yang, R., et al.: CNN-LSTM deep learning architecture for computer vision-based modal frequency detection. Mech. Syst. Signal Process. 144, 106885 (2020)
Poozesh, P., et al.: Feasibility of extracting operating shapes using phase-based motion magnification technique and stereo-photogrammetry. J. Sound Vib. 407, 350–366 (2017)
Sarrafi, A., et al.: Vibration-based damage detection in wind turbine blades using Phase-based Motion Estimation and motion magnification. J. Sound Vib. 421, 300–318 (2018)
Acknowledgments
Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-17-2-0211. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the US Government. The US Government is authorized to reproduce and distribute reprints for government purposes notwithstanding any copyright notation herein.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Society for Experimental Mechanics, Inc.
About this paper
Cite this paper
Dizaji, M.S., Mao, Z. (2023). A Hybrid-Attention-LSTM-Based Deep Convolutional Neural Network to Extract Modal Frequencies from Limited Data Using Transfer Learning. In: Di Maio, D., Baqersad, J. (eds) Rotating Machinery, Optical Methods & Scanning LDV Methods, Volume 6. Conference Proceedings of the Society for Experimental Mechanics Series. Springer, Cham. https://doi.org/10.1007/978-3-031-04098-6_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-04098-6_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04097-9
Online ISBN: 978-3-031-04098-6
eBook Packages: EngineeringEngineering (R0)