1 Introduction

Structural Health Monitoring (SHM) aims to provide valuable information for assessing structural integrity and making maintenance decisions by measuring the structural response [1]. Among various structural indices used in SHM, displacement response plays a crucial role [2]. Monitoring the dynamic displacements of structures under different load types offers valuable insights into their condition and behavior. These dynamic displacements allow for calculating important structural properties such as bearing capacity, deflection, deformation, load distribution, and modal parameters. Furthermore, they can be converted into physical indicators for assessing structural safety.[3] However, traditional displacement measurement methods, which involve placing a limited number of sensors on the surface of a structure, such as Laser Displacement Sensors (LDS) and wireless accelerometers, have limitations. They are cumbersome to install, expensive to maintain, and provide measurements only at discrete points, limiting spatial resolution [4].

In recent years, vision-based displacement sensing technology has emerged as a promising alternative [5]. This technology initially tracks the movement trajectory of the measured target in video and subsequently determines the dynamic displacement of the structure by analyzing the positional relationship between the camera and the structure.[6] Vision-based displacement sensing technology offers advantages such as long-distance capability, non-contact operation, and wide range coverage [7,8,9]. It has garnered significant interest from researchers and engineers in the field of SHM and has been further developed to achieve modal identification [10, 11], model updating [12], damage detection [13, 14], load identification [15, 16], and cable force estimation [17], and other applications.

The core of vision-based displacement sensing technology lies in target tracking algorithms. However, commonly used algorithms have certain limitations. Optical flow methods [18,19,20,21,22] and phase-based motion magnification methods [23, 24] are effective for minor displacement changes but may fail with large target displacements and exhibit limited robustness. Optical flow methods are sensitive to illumination changes and prone to accumulating errors, while phase-based motion magnification methods are prone to noise from non-measurement target movements. Feature point matching methods [25,26,27] can track larger displacements but require adjusting multiple parameters and thresholds. Among these, correlation-based template matching methods [28,29,30,31] stand out due to their robustness, versatility, and minimal user intervention requirement. However, this method typically necessitates mounting targets on the structure and can only extract displacements with integer pixel accuracy [32]. Several scholars have proposed enhanced techniques to address these limitations. Feng & Feng [33] employ upsampling techniques through Fourier transform, which results in a substantial computational burden without significant accuracy improvement. Pan et al.[34] significantly enhance measurement accuracy using the Inverse-Compositional-Gauss–Newton (IC-GN) nonlinear optimization algorithm, yet the iterative and interpolation procedures involved result in high computational costs. Zhang et al.[35] introduce the Modified Taylor Approximation-based sub-pixel refinement (MTA) algorithm as an additional step after correlation-based matching. This algorithm demonstrates excellent computational efficiency and accuracy, but it remains sensitive to changes in illumination or noise. The existing correlation-based template matching approaches still lack the desired robustness and efficiency.

In addition, despite the attention received by vision-based displacement sensing technologies in SHM, their practical applications are still in their infancy. Most research has primarily focused on specific scenarios, such as measuring the dynamic displacement of small-scale structures or the quasi-static displacement localized areas of larger structures [36]. However, for comprehensive SHM, it is crucial to extend the application of monitoring dynamic displacements to full-scale structures, particularly for modal analysis of slender bridge structures. Before vision-based displacement sensors can fully replace traditional sensors in the SHM field, it is essential to conduct further research that explores their application on larger full-scale structural targets and in more challenging environmental conditions.

The study aims to address these gaps through the following objectives:

  1. (1)

    Introducing a Sparse Bayesian Learning-based (SBL) algorithm to enhance robustness, accuracy, and computational efficiency in target tracking. Furthermore, developing a robust and versatile Vision-based Dynamic Displacement Monitoring System (VDDMS) capable of monitoring displacements of varying application scenarios.

  2. (2)

    Verifying the effectiveness of the proposed algorithm and VDDMS through a specially designed indoor test and an outdoor shear wall shaking table test.

  3. (3)

    Conducting a shaking table test on a large-scale (1:40) steel arch bridge model. Employing a single entry-level consumer camera, VDDMS monitors the displacement and frequencies of the bridge under different seismic excitations and lighting conditions, targeting the natural texture features of the structure’s surface.

  4. (4)

    Examining the effect of initial template selection on the accuracy of VDDMS and proposing two fast and convenient error assessment schemes suitable for field applications.

The organization of this paper is as follows: Sect. 2 explains the theoretical framework of the VDDMS, including the introduction of the SBL algorithm. In Sect. 3, two experiments are conducted to verify the validity of the proposed algorithm and system. Section 4 encompasses the shaking table test conducted on the bridge. Finally, Sect. 5 summarizes the conclusions drawn from this study and outlines potential directions for future work.

2 Theoretical framework

The robust and versatile Vision-based Dynamic Displacement Monitoring System (VDDMS) framework is composed of three phases, as depicted in Fig. 1. In phase 1, a camera calibration process is performed to acquire camera distortion parameters and a projection matrix that includes both camera intrinsic and extrinsic parameters. Phase 2 involves tracking selected targets using the proposed SBL algorithm. In phase 3, the physical displacement of the targets in world coordinates is calculated. The VDDMS enables efficient and robust monitoring of dynamic displacements at multiple points. This approach allows for accurate and reliable displacement measurements in various scenarios, providing a practical and accessible solution for monitoring dynamic movements.

Fig. 1
figure 1

Framework of the Vision-based Dynamic Displacement Monitoring System (VDDMS)

2.1 Camera calibration

  1. (1)

    Full camera calibration process

    The camera calibration process establishes the projection relationship between three-dimensional (3D) world coordinates and two-dimensional (2D) image coordinates. The transformation from image coordinates \({\mathbf{x}} = \left( {x,y} \right)\) to world coordinates \({\mathbf{X}} = \left( {X,Y,Z} \right)\) is represented by Eq. (1)

    $$s{\mathbf{x}} = {\mathbf{K}}[{\mathbf{R}}|{\mathbf{t}}]{\mathbf{X}} \Leftrightarrow s\left[ {\begin{array}{*{20}c} x \\ y \\ 1 \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {f_{x} } & \gamma & {c_{x} } \\ 0 & {f_{y} } & {c_{y} } \\ 0 & 0 & 1 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {r_{11} } & {r_{12} } & {r_{13} } & {t_{1} } \\ {r_{21} } & {r_{22} } & {r_{23} } & {t_{2} } \\ {r_{31} } & {r_{32} } & {r_{33} } & {t_{3} } \\ \end{array} } \right]\left[ \begin{gathered} X \hfill \\ Y \hfill \\ Z \hfill \\ 1 \hfill \\ \end{gathered} \right]$$
    (1)

    where \(s\) is the scale factor, \({\mathbf{K}}\) is the camera intrinsic matrix related to the lens, and \(\left[ {{\mathbf{R|t}}} \right]\) is the camera extrinsic matrix related to the relative position between the camera and the measurement object.

    The intrinsic matrix is unchanged so long as the lens focal length does not change. However, the extrinsic matrix must be recalibrated whenever the camera position is changed. Therefore, this study proposes to divide camera calibration into two steps. The intrinsic camera matrix and distortion parameters are estimated using a checkerboard method [37]. The extrinsic camera matrix is obtained through the Perspective-n-point method [38] using 2D-to-3D point correspondences.

  2. (2)

    Simplified camera calibration process

    In scenarios where the full camera calibration process poses challenges, two simplified calibration procedures can be used. The first is the planar homography matrix method [38], simplifies Eq. (1) using the planar homography matrix \({\mathbf{H}}\), as shown in Eq. (2). The Perspective-n-point method can be applied to solve this equation.

    $$s{\mathbf{x}} = {\mathbf{HX}} \Leftrightarrow \left( \begin{gathered} x \hfill \\ y \hfill \\ 1 \hfill \\ \end{gathered} \right) = \left[ {\begin{array}{*{20}c} {h_{11} } & {h_{12} } & {h_{13} } \\ {h_{21} } & {h_{22} } & {h_{23} } \\ {h_{31} } & {h_{32} } & {h_{33} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} X \\ Y \\ 1 \\ \end{array} } \right]$$
    (2)

The second is the scale factor method, which is the most frequently used method in other studies. [39] The scale factor method estimates calibration parameters using the physical size \(D\) of the selected object and its corresponding pixel number \(d\) in the image plane. The scale factor method can be represented by the following equation.

$$s = \frac{D}{d}$$
(3)

The scale factor method, while simple, has several limitations [40]. The choice of camera calibration process depends on the specific characteristics of the camera, lens, and motion involved. Here are some suggestions for selecting the camera calibration process: (1) For scenarios where the measured structure exhibits one-dimensional motion and the camera is positioned perpendicular to the movement plane, the scale factor method is a suitable choice for camera calibration. However, it is important to note that when working with targets at different positions, recalibration might be necessary. (2) When measuring a structure that moves on a 2D plane and the distortion of the lens is small, the homography matrix method is suitable. The camera tilt angle does not affect the measurement results. (3) In scenarios requiring 3D displacement measurement, a large field of view, or significant lens distortion, a full camera calibration process is necessary [41]. This comprehensive calibration accounts for the lens distortion and ensures accurate and reliable measurements for each target in the complex environments.

By considering these suggestions and selecting the appropriate camera calibration process, the VDDMS can adapt to different scenarios and provide reliable and precise displacement measurements for dynamic movements. In the experimental part of this study, three camera calibration process are used separately.

2.2 The SBL algorithm

2.2.1 Initial integer-pixel displacement

The framework of the Sparse Bayesian Learning-based (SBL) targets tracking algorithm, shown in Fig. 2, comprises two main steps: initial integer-pixel displacement estimation and further sub-pixel refinement. In the initial integer-pixel displacement estimation, correlation-based template matching is used. The selected template is slid on another image, and the best match is found by calculating the correlation between the template and the overlapping region during the sliding process. The zero-mean normalized cross-correlation coefficient (ZNCC) is utilized to quantify the correlation strength between the variables in this paper. Compared with other correlation coefficients, it provides more accurate and reliable results and is insensitive to offsets and scale changes in the intensity of the target area [42]. The ZNCC is calculated as follows.

$$ZNCC\left( {u,v} \right) = \frac{{\sum\nolimits_{{x = x_{1} }}^{{x_{2} }} {\sum\nolimits_{{y = y_{1} }}^{{y_{2} }} {\left( {T\left( {x,y} \right) - \overline{T} } \right)\left( {I^{i} \left( {x + u,y + v} \right) - \overline{{I^{i} }} } \right)} } }}{{\sqrt {\sum\nolimits_{{x = x_{1} }}^{{x_{2} }} {\sum\nolimits_{{y = y_{1} }}^{{y_{2} }} {\left( {T\left( {x,y} \right) - \overline{T} } \right)^{2} } } } \sqrt {\sum\nolimits_{{x = x_{1} }}^{{x_{2} }} {\sum\nolimits_{{y = y_{1} }}^{{y_{2} }} {\left( {I^{i} \left( {x + u,y + v} \right) - \overline{{I^{i} }} } \right)^{2} } } } }}$$
(4)

where \(T\left( {x,y} \right)\) and \(I^{i} \left( {x,y} \right)\) represent the grayscale intensity of the first frame and i-th frame, respectively. The \(u\) and \(v\) values denote integer-pixel displacement change. \(\left( {x_{1} ,y_{1} } \right)\) and \(\left( {x_{2} ,y_{2} } \right)\) are the coordinates of the top-left and bottom-right corners of the template in the first frame. \(\overline{T}\) and \(\overline{{I^{i} }}\) denote the mean intensity values of the template and the overlapping region, respectively. \(\overline{T} = \frac{1}{{\mathcal{A}}}\sum\nolimits_{{x = x_{1} }}^{{x_{2} }} {\sum\nolimits_{{y = y_{1} }}^{{x_{2} }} {T\left( {x,y} \right)} }\) and \(\overline{{I^{i} }} = \frac{1}{{\mathcal{A}}}\sum\nolimits_{{x = x_{1} }}^{{x_{2} }} {\sum\nolimits_{{y = y_{1} }}^{{x_{2} }} {I\left( {x + u,y + v} \right)} }\), where \({\mathcal{A}}\) represents the area of the template.

Fig. 2
figure 2

Framework of the Sparse Bayesian learning-based targets tracking (SBL) algorithm

While the ZNCC is at its maximum, indicating the best match in overlapping region. While sliding the template over the entire image can be time-consuming, an adaptive matching region algorithm is proposed to improve efficiency in this paper. The algorithm performs the template matching process on a local region instead of the entire image. The center of the sliding matching region for the template in each frame corresponds to the center of the best match in the previous frame. This region is larger than the template, with its size denoted as \(\lambda \cdot {\mathcal{A}}\). If the maximum ZNCC of the current frame is less than 0.8, increase the value of \(\lambda\) and repeat the template matching process until the local region expands to the image boundary. This adaptive matching region technique improves the efficiency of the matching process while reducing the probability of false matches.

2.2.2 Refined sub-pixel displacement

The correlation-based template matching can only estimate the pixel-level displacement changes. Further refinement is required to obtain more accurate and reliable sub-pixel displacement changes. The relationship between the intensity of a physical point in the template of the first frame and the intensity of the corresponding point in the best matching region of the i-th frame can be expressed as:

$$T\left( {x,y} \right) = I^{i} \left( {x + u + \Delta u,y + v + \Delta v} \right)$$
(5)

where \(u\) and \(v\) are the initial integer-pixel displacement components, \(\Delta u\) and \(\Delta v\) are the sub-pixel displacement components, respectively.

To account for grayscale changes caused by illumination variations or overexposure/underexposure, a nonlinear brightness variation model is introduced as follows:

$${\text{U}} \left( {{{\varvec{\uptheta}}}^{i} ;T\left( {x,y} \right)} \right) = \theta_{1}^{i} \cdot T\left( {x,y} \right)^{2} + \theta_{2}^{i} \cdot T\left( {x,y} \right) + \theta_{3}^{i}$$
(6)

where \({{\varvec{\uptheta}}}^{i}\) is a parameter vector that describes the grayscale transformation between the initial template and the best matching region of the i-th frame. More complex nonlinear brightness variation models can be formulated by modifying this equation.

The first-order Taylor expansion of Eq. (5) at \(\left( {x + u,y + v} \right)\) is as follows

$$\theta_{1}^{i} \cdot T\left( {x,y} \right)^{2} + \theta_{2}^{i} \cdot T\left( {x,y} \right) + \theta_{3}^{i} - \Delta u \cdot I_{x + u}^{i} - \Delta v \cdot I_{y + v}^{i} = I^{i} \left( {x + u,y + v} \right)$$
(7)

where \(I_{x + u}^{i}\) and \(I_{y + v}^{i}\) are the spatial gradients of the i-th frame at \(\left( {x + u,y + v} \right)\), calculated using the gray gradient algorithm based on the Barron operator [43].

A sub-pixel displacement regression reference model is established using the sparse Bayesian learning scheme [44, 45]. It assumes that the grayscale change of each pixel point within the template is consistent. Let n represent the index of the pixel points, ranging from n = 1 to N, where N is the total number of pixels in the template. The grayscale data on both sides of Eq. (7) are represented as input samples \({\mathbf{x}}_{n}\) and target values \(t_{n}\), respectively. The training dataset \({\mathcal{D}} = \left\{ {{\mathbf{x}}_{n} ,t_{n} } \right\}_{n = 1}^{N}\) is then used in the sparse Bayesian learning scheme, and noise \(\varepsilon_{n}\) is introduced:

$$t_{n} = y\left( {{\mathbf{x}}_{n} ,{\mathbf{w}}} \right) + \varepsilon_{n} = w_{0} + \sum\limits_{j = 1}^{M - 1} {w_{j} \cdot \phi_{j} \left( {{\mathbf{x}}_{n} } \right)} + \varepsilon_{n}$$
(8)

where \(y\left( {{\mathbf{x}}_{n} ,{\mathbf{w}}} \right)\) represents the objective function, \({\mathbf{w}} = \left[ {w_{0} ,w_{1} ,...,w_{M - 1} } \right]\) represents an unknown parameter vector, where parameter \(w_{j}\) either sub-pixel displacement or illumination change parameters, and \(\phi_{j} \left( {{\mathbf{x}}_{n} } \right)\) is the basic function.

Assuming \(\varepsilon_{n}\) to be a Gaussian error vector in the sparse Bayesian framework, it can be expressed as \(\varepsilon_{n} \sim {\mathcal{N}}\left( {0,\sigma^{2} } \right)\). The likelihood of the complete data set can be written as

$$p\left( {{\mathbf{t}}|{\mathbf{x}},{\mathbf{w}},\sigma^{2} } \right) = \left( {2\pi \sigma^{2} } \right)^{ - N/2} \exp \left\{ { - \frac{1}{{2\sigma^{2} }}\left\| {{\mathbf{t}} - {{\varvec{\Phi}}}\left( {\mathbf{x}} \right){\mathbf{w}}} \right\|_{2}^{2} } \right\}$$
(9)

where \({{\varvec{\Phi}}}\left( {\mathbf{x}} \right)\) is the design matrix, and \({{\varvec{\Phi}}}\left( {\mathbf{x}} \right) = \left[ {\Phi \left( {{\mathbf{x}}_{1} } \right),\Phi \left( {{\mathbf{x}}_{2} } \right),...,\Phi \left( {{\mathbf{x}}_{N} } \right)} \right]^{T},\)\(\Phi \left( {{\mathbf{x}}_{n} } \right) = \left[ {\phi_{0} \left( {{\mathbf{x}}_{n} } \right),\phi_{1} \left( {{\mathbf{x}}_{n} } \right),...,\phi_{M - 1} \left( {{\mathbf{x}}_{n} } \right)} \right]^{T}\).

In order to avoid overfitting and promote sparsity, the prior distribution of \({\mathbf{w}}\) of Eq. (9) is assumed to be a zero-mean Gaussian distribution, and a separate hyperparameter \(\alpha_{j}\) is introduced for each \(w_{j}\). The prior distribution of \({\mathbf{w}}\) is given by

$$p\left( {{\mathbf{w}}|{{\varvec{\upalpha}}}} \right) = \prod\limits_{j = 0}^{M - 1} {{\mathcal{N}}\left( {w_{j} |0,\alpha_{j}^{ - 1} } \right)}$$
(10)

where \(\alpha_{j}\) represents the precision of the corresponding parameter \(w_{j}\).

Through the Bayesian inference, the distribution of the posterior parameter \({\mathbf{w}}\) can be derived as follows:[44]

$$p\left( {{\mathbf{w}}|{\mathbf{t}},{\mathbf{x}},{{\varvec{\upalpha}}},\beta } \right) = 2\pi^{{\frac{ - N + 1}{2}}} \left| \sum \right|^{{ - \frac{1}{2}}} \exp \left\{ { - \frac{1}{2}\left( {{\mathbf{w}} - {\mathbf{m}}} \right)^{T} \sum \left( {{\mathbf{w}} - {\mathbf{m}}} \right)} \right\}$$
(11)

where the posterior parameter distribution is denoted by \({\mathbf{w}} \sim {\mathcal{N}}\left( {{\mathbf{m}},\sum } \right)\), and \(\beta = \sigma^{ - 2}\). The posterior mean \({\mathbf{m}}\) and covariance \(\sum\) are as follows:

$${\mathbf{m}} = \beta \sum {{\varvec{\Phi}}}^{T} {\mathbf{t}}$$
(12)
$$\sum = \left( {\beta {{\varvec{\Phi}}}^{T} {{\varvec{\Phi}}} + {\mathbf{A}}} \right)^{ - 1}$$
(13)

with \({\mathbf{A}} = diag({{\varvec{\upalpha}}})\).

The values of \({{\varvec{\upalpha}}}\) and \(\beta\) are determined using the maximum likelihood method of the second kind.[46] In this method, the marginal likelihood function is maximized by integrating the weight vector. It can be expressed as:

$$p\left( {{\mathbf{t}}|{\mathbf{x}},{{\varvec{\upalpha}}},\beta } \right) = \int {p\left( {{\mathbf{t}}|{\mathbf{x}},{\mathbf{w}},\beta } \right)} p\left( {{\mathbf{w}}|{{\varvec{\upalpha}}}} \right)d{\mathbf{w}}$$
(14)

Directly maximizing the marginal likelihood function is computationally complex. Therefore, the logarithm of the marginal likelihood function is maximized, which can be expressed as:

$$\ln p\left( {{\mathbf{t}}|{\mathbf{x}},{{\varvec{\upalpha}}},\beta } \right) = - \frac{1}{2}\left\{ {N\ln \left( {2\pi } \right) + \ln \left| {\mathbf{C}} \right| + {\mathbf{t}}^{T} {\mathbf{C}}^{ - 1} {\mathbf{t}}} \right\}$$
(15)

with \({\mathbf{C}} = \beta^{ - 1} {\mathbf{I}} + {\mathbf{\Phi A}}^{ - 1} {{\varvec{\Phi}}}^{T}\).

By setting the derivative of the marginal likelihood function to zero, the following reestimation equations are obtained:

$$\alpha_{j}^{new} = \gamma_{j} /m_{j}^{2}$$
(16)
$$\left( {\beta^{new} } \right)^{ - 1} = \frac{{\left\| {{\mathbf{t}} - {\mathbf{\Phi m}}} \right\|^{2} }}{{N - \Sigma_{j} \gamma_{j} }}$$
(17)

with \(\gamma_{j} = 1 - \alpha_{j} \sum_{jj}\).

In the final step, the optimized parameters \({{\varvec{\upalpha}}}\) and \(\beta\) are substituted into Eq. (12) and Eq. (13). The resulting posterior mean \({\mathbf{m}}\) is considered to be the precise value for the parameter vector \({\mathbf{w}}\), which includes sub-pixel displacement and illumination change parameters.

For a more detailed explanation of the proposed SBL algorithm procedure, please consult Fig. 3. The process follows the steps outlined below.

Fig. 3
figure 3

Flowchart of the proposed SBL algorithm

  1. (1)

    Choose a template from the initial frame of the video;

  2. (2)

    Read the current frame of the video, then determine the search region for the current frame based on the position of the template from the previous frame;

  3. (3)

    Perform template matching based on correlation using Eq. (4);

  4. (4)

    If \(ZNCC_{\max } > 0.8\), determine Integer-pixel displacements \(u\) and \(v\), otherwise, expand the search region and go back to step 3;

  5. (5)

    Set initial values for \({{\varvec{\upalpha}}}\) and \(\beta\);

  6. (6)

    Calculate the mean \({\mathbf{m}}\) and covariance \(\sum\) of the posterior probabilities using Eq. (12) and Eq. (13);

  7. (7)

    Update hyperparameters \({{\varvec{\upalpha}}}\) and \(\beta\) using Eq. (16) and Eq. (17);

  8. (8)

    Cycle steps 6 and 7 until reaching the maximum number of cycles or convergence;

  9. (9)

    Extract sub-pixel displacement from the mean vector \({\mathbf{m}}\).

The implementation of the proposed method, as presented in this study, is carried out using the Python 3.8 programming language in conjunction with the open-source computer vision library OpenCV 4.5.

2.3 Physical displacement estimation

When a full camera calibration process is employed, it can provide the essential camera distortion parameters required for correcting lens distortion in displacement measurements. However, directly correcting the raw video would impose a significant computational burden. Therefore, in this study, the proposed approach involves running the target tracking algorithm first, followed by the correction of the coordinate points representing the center position of the target in each frame. The correction process is as follows:

$$\begin{gathered} x_{c} = x\left( {1 + k_{1} r^{2} + k_{2} r^{4} + k_{3} r^{6} } \right) + 2p_{1} xy + p_{2} \left( {r^{2} + 2x^{2} } \right) \hfill \\ y_{c} = y\left( {1 + k_{1} r^{2} + k_{2} r^{4} + k_{3} r^{6} } \right) + 2p_{2} xy + p_{1} \left( {r^{2} + 2x^{2} } \right) \hfill \\ \end{gathered}$$
(18)

where \(\left( {x,y} \right)\) are the image coordinates, \(\left( {x_{c} ,y_{c} } \right)\) are the corrected coordinates, \(r\) is the Euclidean distance of the distorted point to the distortion image center, \(k_{i}\) and \(p_{i}\) are the distortion parameters.

Next, the image coordinates are transformed into world coordinates based on the camera calibration results. It is important to note that recovering the out-of-plane (Z-axis) coordinates of the measured structure using a single camera is theoretically impossible. [47] Therefore, in this study, the assumption is made that the Z value is a constant. Based on Eq. (1), the modified transformed equation when Z is set to 0 can be expressed as follows:

$$\left[ \begin{gathered} X \hfill \\ Y \hfill \\ s \hfill \\ \end{gathered} \right] = \left[ {\begin{array}{*{20}c} { - r_{11} } & { - r_{12} } & {x^{\prime}} \\ { - r_{21} } & { - r_{22} } & {y^{\prime}} \\ { - r_{31} } & { - r_{31} } & 1 \\ \end{array} } \right]^{ - 1} \left[ \begin{gathered} t_{1} \hfill \\ t_{2} \hfill \\ t_{3} \hfill \\ \end{gathered} \right]$$
(19)

with \(\left[ {\begin{array}{*{20}c} {x^{\prime}} & {y^{\prime}} & 1 \\ \end{array} } \right]^{T} = {\mathbf{K}}^{ - 1} \left[ {\begin{array}{*{20}c} x & y & 1 \\ \end{array} } \right]^{T}\).

To estimate the physical displacement, the initial template center coordinates are subtracted from the world coordinates of the center of the best match in each frame. This calculation yields the physical displacement change of the measured target, which can be expressed as follows:

$$\left[ {\begin{array}{*{20}c} {dX^{i} } \\ {dY^{i} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {X^{i} } \\ {Y^{i} } \\ \end{array} } \right] - \left[ {\begin{array}{*{20}c} {X^{1} } \\ {Y^{1} } \\ \end{array} } \right]$$
(20)

3 Verification

3.1 Indoor experiment

To verify the robustness and computational efficiency advantages of the SBL algorithm, we conducted a novel experiment was conducted for dynamic displacement monitoring. The experiment involved the synthesis of a video depicting target movement, which was then played on a liquid crystal display (LCD). By leveraging the precise physical distance between adjacent pixels, the displacement of the moving target on the LCD could be accurately controlled. The VDDMS system was employed to measure the physical displacement of the moving target.

The experiment employed an RMMNT27NQ LCD model, featuring a resolution of 2560 × 1440 pixels with a pixel pitch of 0.233 mm. The moving target consisted of a logo pattern containing a Schneider code [48], measured 69.9 × 69.9 mm. The target exhibited horizontal motion at a speed of 4.66 mm/s over a duration of 10 s. As shown in Fig. 4, the experimental setup included compact entry-level action camera, DJI Pocket 2, featuring a resolution of 1920 × 1080 pixels and a frame rate of 60 fps. The camera was positioned perpendicularly to the target, and the calibrated scale factor was established at 0.48 mm/pixel.

Fig. 4
figure 4

The indoor experimental setup

The experiment comprised four subtests, except for Subtests (a), each introducing specific variations. Subtests (b) and (c) involved gradual changes in the gray intensity of the moving target to simulate varying illumination conditions. In subtest (d), Gaussian noise with an increasing standard deviation up to 0.2 was applied to the target during its movement. The variations in the moving target for each subtest are depicted in Fig. 5. All subtests use a consistent initial template of size 145 × 145 pixels. The MTA algorithm [35], IC-GN algorithm [34], Kanade-Lucas-Tomasi (KLT) algorithm [49], and the SBL algorithm were employed for the four subtests. The MTA and IC-GN algorithms were initialized using the same ZNCC-based integer-pixel level displacement search process as the SBL algorithm. All experiments in this study were performed on a computer equipped with an AMD Ryzen 7 5800X @3.80 GHz CPU.

Fig. 5
figure 5

The variations in the moving target for each subtest

To evaluate the measurement error globally, the Root Mean Square Error (RMSE) was calculated using the following equation:

$$RMSE = \frac{1}{n}\sqrt {\sum\limits_{i = 1}^{n} {\left( {d_{gt} - d_{v} } \right)^{2} } }$$
(21)

where \(n\) is the number of displacement data points, \(d_{gt}\) are the ground truth of displacement values, and \(d_{v}\) is the value measured by the vision-based displacement sensors.

Figure 6 and Table 1 illustrate the error and computational speed of each target tracking algorithm. In subtest (a), the MTA, IC-GN, and SBL algorithms have consistent accuracies with an RMSE of 0.05 mm. The RMSE, which is 1/10 pixel when applying the inverse transformation of the scale factor, demonstrates the effectiveness of the algorithms employed in this experiment. The KLT algorithm exhibited an increasing error over time, with an RMSE of 0.34 mm. The KLT algorithm is prone to errors when tracking fast targets, and these errors gradually accumulate over time. This accumulation of errors is a result of calculating displacements between consecutive frames in the KLT algorithm[27].

Fig. 6
figure 6

Error and computational speed of each target tracking algorithm

Table 1 RMSE and computation speed of the different algorithms

In subtests (b) and (c), as the gray intensity variation of the moving object increases, the MTA algorithm is more affected, leading to a larger error range. The IC-GN algorithm show good robustness for small brightness changes, but when significant dimming occurs, the iterative optimization process of the IC-GN algorithm deviates from the correct direction, causing a sudden increase in error range. In contrast, the SBL algorithm consistently maintains the error within a controllable range regardless of the intensity variation. In subtest (d), the accuracy of the SBL algorithm decreases slightly but remains at a reasonable level of accuracy.

Among the four algorithms, the KLT algorithm stands out for its high computational efficiency, processing an average of 42.3 frames per second (fps). On the other hand, due to the interpolation and iterative calculations involved, the IC-GN algorithm has a significantly lower average computation speed of only 0.24 fps. The MTA and SBL algorithms exhibit comparable computation speeds of 16.4 and 15.8 fps, respectively. It is worth noting that the computation speed is influenced by the size of the template used. Opting for a smaller template size can enhance the speed of the SBL algorithm.

In summary, the SBL algorithm demonstrates greater robustness in terms of measurement accuracy across different subtests, while maintaining satisfactory computational efficiency. Its ability to consistently keep the error within a controllable range makes it a promising choice for vision-based displacement sensing technologies.

3.2 Field experiment

In 2018, the University of California, San Diego conducted shake table tests to investigate the lateral response of steel sheet sheathed cold-formed steel framed in-line wall systems. The test details, reports, videos, and data can be found on DesignSafe [50]. Among the test specimens, SGGS-1 from Test Group 1 was a shear-gravity-gravity-shear wall line specimen measuring 4.8 × 2.7 m, as shown in Fig. 7. To measure the wall drift, a string potentiometer was installed on the side face of the beam at the top of the specimen.

Fig. 7
figure 7

The specimen SGGS-1 of Test Group 1 (Shot by the DVR)

The “EQ1” test, which used the amplitude-modulated Canoga Park record component CNP196 of the 1994 Mw = 6.7 Northridge Earthquake, was selected for analysis. The specimen remained within the elastic range during the test. A fixed digital video recorder (DVR) camera placed south of the specimen recorded the dynamic test process. The captured video had a resolution of 1920 × 1080 pixels, a frame rate of 30 fps, and a total of 2008 frames. VDDMS analyzed the video data to monitor the displacement of the specimen.

On top of the specimen, a concrete weight plate was equipped with a 3 × 3 checkerboard, with each grid having a side length of 12 cm [51]. The corner coordinates of the checkerboard were extracted, and the planar homography matrix method was employed to estimate the projective transformation. Two adjacent initial templates, sized 87 × 87 pixels, were selected on top of the specimen. One template contained artificial targets, while the other had natural targets.

Figure 8 illustrates the displacement of the specimen’s top measured by VDDMS and the string potentiometer. The RMSE and computation speed of VDDMS are presented in Table 2, assuming that the measurement results of the string potentiometer are considered ground truth. The RMSE of the natural target measured by VDDMS was 0.37 mm, which is 2.7% higher than that of the artificial target. After applying the inverse projective transformation, the RMSE of the natural target was 1/12 pixel. For templates of the same size, VDDMS processed both natural and artificial targets at a speed of 39 fps. Hence, the computation speed was found to be independent of the target type within the template.

Fig. 8
figure 8

Displacement comparison between VDDMS and string potentiometer measurements at the top of the specimen

Table 2 RMSE and computation speed of VDDMS

4 Shaking table test of large-scale bridge

4.1 Experimental setup

A shaking table test of a large-scale (1:40) steel box basket-handle arch bridge model was conducted in February 2022 at the Beijing University of Technology. VDDMS was used to monitor multiple points on the model bridge under various seismic excitations. As shown in Fig. 9, the bridge model features a main span of 7500 mm, two side spans of 1250 mm each, and is supported by 24 stay cables.

Fig. 9
figure 9

Shaking table testing of large-scale steel arch bridge model

The shaker system consists of six small shakers with dimensions of 1 × 1 m. Table 3 provides details of the seismic excitation input scenarios used in the experiment, including the type of excitation, peak ground acceleration, and direction. Three LDS and accelerometers were installed at specific locations on the bridge model to measure longitudinal displacements. The response of the bridge model to seismic vibration was captured using the Panasonic Lumix DMC-FZ2500 camera, which is an entry-level consumer camera. The camera, positioned approximately 12 m away from the bridge model, was not precisely adjusted and had a noticeable tilt angle. The captured video had a resolution of 3840 × 2160 pixels and a frame rate of 30 fps, covering the vibration modes of most civil engineering structures.

Table 3 Seismic excitation cases

A complete camera calibration process was adopted in this experiment. The intrinsic and distortion parameters of the Panasonic Lumix DMC-FZ2500 camera were determined in advance by analyzing checkerboard images from different locations and orientations. The camera’s focal length was locked, and the extrinsic parameters were determined by using four pairs of 2D-to-3D points. The distances between these control points were obtained from field measurements. A world coordinate system was established, with the X-axis aligned along the bridge span direction and the Y-axis in the vertical direction. Initial templates, sized 51 × 51 pixels, were selected near the LDS measurement points (T1: mid-span of the bridge, T2: top of the arch rib, T3: 1/4 height of the arch rib), as show in Fig. 10. These templates captured the natural texture features on the structural surface. All pixel points within the templates are on the same plane and share a common motion trajectory.

Fig. 10
figure 10

The location of the initial templates

4.2 Results of displacement

Figure 11 illustrates the displacement measurements of three targets obtained from both the VDDMS and LDS in case 1. To ensure accurate comparison, the signals from both measurement methods were aligned to a common reference time, and any minor time shifts were corrected using the maximum cross-correlation technique. The figure demonstrates a good agreement between the displacement measurements obtained from VDDMS and LDS. This indicates that the VDDMS is capable of accurately capturing the vibration displacements of the targets, comparable to the measurements obtained from the traditional LDS. Figure 12 show the distribution of absolute displacement differences measured by VDDMS in cases 1 to 6. Table 4 provides the Peak-to-peak (Pk-pk) and RMSE values of the VDDMS displacement measurements, assuming the measurement results of the LDS are considered ground truth.

Fig. 11
figure 11

Displacement measurements of three targets obtained from VDDMS and Laser Displacement Sensors (LDS) in Case 1

Fig. 12
figure 12

Distribution of absolute displacement differences measured by VDDMS and LDS from Case 1 to 6

Table 4 Pk-pk and RMSE of displacement measured by the VDDMS

One notable observation from the experiment is related to the Pk-pk displacement of different cases. Case 3 has a smaller Pk-pk displacement compared to Case 1 and Case 2. Similarly, Case 6 exhibits significantly smaller Pk-pk displacement than Case 4 and Case 5. The displacement amplitudes induced by artificial waves are larger than those caused by natural waves.

Another noteworthy observation pertains to measurement accuracy. Despite higher peak ground accelerations in the seismic excitation, resulting in faster and larger vibrations of the structure, the RMSE does not show a significant increase. Interestingly, the Normalized RMSE actually decreases noticeably. These indicates that the velocity of the target motion has a limited impact on the measurement accuracy of the VDDMS system. It can also be inferred that the primary errors in the VDDMS are fixed systematic errors, likely introduced during camera calibration.

Furthermore, it is observed that the RMSE of target T3 is generally larger than the other two targets, indicating a larger error for the target at the edge of the image. This could be attributed to the greater degree of distortion for consumer cameras farther away from the image center. Additionally, different initial templates may have varying displacement measurement accuracies.

Across all measurement targets and seismic excitation conditions, the maximum and minimum values of RMSE are 0.64 mm and 0.20 mm, respectively, while the maximum and minimum values of Normalized RMSE are 2.9% and 0.2%. These results demonstrate that the VDDMS accurately monitors multiple targets within a large range on the steel arch bridge model under different seismic excitations, utilizing an entry-level consumer camera and the natural texture features of the structure’s surface. Remarkably, this monitoring system does not require precise camera position adjustments, making it practical and effective for real-world applications.

4.3 Illumination change robustness and computing efficiency

In this section, we evaluate the illumination robustness and computational efficiency of VDDMS, highlighting its advantages. To simulate the changing illumination during the movement of the bridge structure, we modify the grayscale values of frames in the recorded video from case 1. Three illumination conditions are considered: (a) no change, (b) brightened at 2nd seconds, and (c) dimmed at 2nd seconds. We measure the displacement of target T1 under these different illumination conditions.

Figure 13 illustrate the difference from LDS and computation speed of VDDMS using four target tracking algorithms: MTA, ICGN, SBL, and KLT. The corresponding RMSEs are listed in Table 5. The initial template plate size for each condition is set to 51 × 51 pixels. Under illumination condition (a), the accuracy of VDDMS using all four algorithms is consistent, yielding an RMSE of 0.22 mm. However, for conditions (b) and (c), the MTA and KLT algorithms exhibit increased errors as the illumination becomes brighter or darker, while IC-GN and SBL algorithms remain stable. The MTA algorithm calculates the displacement change between the initial frame and subsequent frames, resulting in measurement errors when there is a difference in illumination compared to the initial state. In contrast, the KLT algorithm computes displacement changes between adjacent frames, introducing errors only when there is a change in illumination, but these errors accumulate over time. The SBL and IC-GN algorithms exhibit higher illumination robustness.

Fig. 13
figure 13

Difference from LDS and computation speed of VDDMS using MTA, ICGN, SBL, and KLT algorithms for Target T1

Table 5 RMSE and computation speed of VDDMS using MTA, ICGN, SBL, and KLT algorithms

In terms of computational speed, the IC-GN algorithm operates at a significantly slower pace, with approximately 1.5 fps. In contrast, the SBL algorithm achieves a much faster computation speed of 52.6 fps. Therefore, the proposed SBL algorithm demonstrates better illumination robustness and computational efficiency in the VDDMS system.

4.4 Identify structural dynamic characteristics

We conducted an analysis of the displacement spectrum of the model bridge subjected to white noise excitation. It is worth noting that the measured displacements do not require conversion into real physical displacement units. Figure 14 presents the measurements obtained from VDDMS and accelerometers under white noise excitation, along with the corresponding spectral analysis results. The first-order vibration mode of the model bridge is characterized by beam longitudinal drift, while the second-order vibration mode corresponds to arch transverse bending. Table 6 provides the frequencies of the model bridge obtained from both the accelerometer and VDDMS measurements. The frequencies measured by VDDMS closely match those obtained from the acceleration spectrum analysis, with values of 0.90 Hz and 7.53 Hz. Specifically, the VDDMS frequencies are lower by 0.01 Hz and 0.03 Hz compared to the acceleration-based results.

Fig. 14
figure 14

Spectral analysis comparison of VDDMS and accelerometer measurements under white noise excitation

Table 6 Frequency comparison of model bridge obtained by accelerometer and VDDMS

In conclusion, the displacement spectrum analysis demonstrates the reliability of VDDMS in accurately capturing the vibration frequencies of the model bridge under white noise excitation, with close agreement to the acceleration-based results.

4.5 Initial template selection

The selection of the initial template is a crucial step in VDDMS, and this section investigates its influence on measurement accuracy. Five initial templates were chosen near the top of the arch ribs, and their information is presented in Table 7. The distinctiveness of template features was quantified using the Sum of the Square of Subset Intensity Gradients (SSSIG) [52] in the x and y directions, where higher SSSIG values indicate richer texture features within the template. VDDMS measured the displacement changes of five targets in case 5, and the error distribution is depicted in Fig. 15. The RMSE values for templates F1 and F3 were 0.64 mm and 0.59 mm, respectively, which were lower than templates F2 and F4. Template F5, lacking distinctive texture features, caused VDDMS to lose track of the target during the measurement process.

Table 7 Information on the five initial templates
Fig. 15
figure 15

Displacement difference distribution of targets F1 to F5 as measured by VDDMS

These findings emphasize the significance of template distinctiveness in influencing measurement accuracy. Templates with higher SSSIG values, indicative of richer texture features, exhibited lower RMSE values. Therefore, it is crucial to carefully select initial templates with distinctive texture features to ensure reliable target tracking throughout the measurement process.

4.6 Error evaluation schemes

The accuracy of vision-based displacement sensing technologies is affected by various factors, including hardware devices, methods, environment, and so on [53]. In practical application, fast error evaluation is crucial. However, practical applications often lack comparable measurements from LDS, making it challenging to evaluate the reliability of vision-based displacement sensors. Hence, this study proposes two evaluation schemes for measuring the error of vision-based displacement sensors.

The first scheme involves extracting the displacements of measurement targets in the static state of the structure before the test, while the second scheme focuses on extracting the displacements of stationary background targets during the test. Since the actual displacement value of these targets should ideally be zero, any non-zero measurements can be considered as measurement errors. The first error evaluation scheme is designed to assess errors arising from hardware devices and algorithms, making it particularly suitable for short-term displacement measurements. On the other hand, the second scheme aims to evaluate errors attributed to environmental factors and is more applicable for long-term displacement monitoring purposes.

In this study, a video of the bridge in the static state before case 1 was captured, and the displacements of targets T1 to T3 were extracted using the first error evaluation scheme. Additionally, the displacements of the three stationary background targets BJ1 to BJ3 were extracted from the video of case 1. The results of the two error evaluation schemes are presented in Fig. 16 and Table 8. In the first scheme, the maximum RMSE is 1/7 pixels, and for the second scheme, the maximum RMSE is 1/10 pixel. It is observed that the measurement accuracy of T3 is lower than that of T1, consistent with the experimental results in the previous section. Furthermore, the second scheme indicates that the environmental factors had a minimal impact on measurement error, less than 1/10 pixel, in this experiment.

Fig. 16
figure 16

Results of error evaluation schemes

Table 8 RMSE of error evaluation schemes

5 Conclusions

This research contributes to the advancement of vision-based sensing technologies in SHM. The robustness and versatility of VDDMS in different application scenarios are proven by a series of experiments. The key conclusions derived from this research can be summarized as follows:

  1. 1.

    The SBL algorithm demonstrates superior robustness in handling illumination changes compared to the KLT and MTA algorithms. It also offers a faster computational efficiency than the ICGN algorithm, approaching the high-efficiency MTA algorithm. The VDDMS accurately monitors natural targets on large-scale shear walls under outdoor conditions, achieving an RMSE of 0.37 mm, which is 3% higher than that of artificial targets.

  2. 2.

    By utilizing an entry-level consumer camera and leveraging the natural texture features of the structure’s surface, VDDMS showcases its accurate monitoring capability for multiple targets on bridge models under various seismic excitations, eliminating the need for precise camera position adjustments. The RMSE, in comparison to laser displacement sensors, ranges from 0.2% to 2.9% of the peak-to-peak displacement. Furthermore, VDDMS effectively identifies the multi-order frequencies of the model bridge, which closely align with the results obtained from accelerometers.

  3. 3.

    The distinctiveness of initial template features is found to be a crucial factor influencing the accuracy of VDDMS. Furthermore, the proposed two error evaluation schemes can quickly evaluate the reliability of vision-based displacement sensing techniques, and they can be conveniently applied in field measurements.

These findings provide valuable insights for the future development of vision-based displacement sensing technologies. It is important to noted that VDDMS may face challenges when natural texture features lack prominence, a common limitation in vision-based displacement sensing technologies. To overcome this limitation, future research should focus on extracting deeper target features, potentially through advanced deep learning techniques. Furthermore, the development of computer vision-based 3D displacement sensing techniques is necessary to address the challenges associated with 3D measurements in SHM applications.