Keywords

13.1 Introduction

Structural Health Monitoring (SHM) [1] has been an active field for several decades. Monitoring the changes that a structure undergoes during its life span through adoption of modern sensors. These sensors include wireless sensors [2], or wired- sensors like Linear Variable Differential Transformers ( LVDTs) [3] and improved versions of strain-gauges [4]. However, there are limitations associated with most of the present state-of-the-art SHM sensors due to their contact-based nature, there still exist chances of errors in the results. Moreover, these are only capable of providing the point data corresponding to a specific point when compared to the computer vision-based approach. In recent years, attention has been drawn to the application of computer vision for purpose of deformation measurement in structures. In this realm, the deformation is captured using the pixel locations that are extracted from camera frames [5] by fusing the vision-based data with the acceleration data. To deal with the issue of measuring the deformations from the contact-based sensors, some studies [6] utilized the application of virtual markers called imaging key points as a potential replacement of physical targets. To make the applications of machine vision-based techniques more robust, some work [7] has also been done to improve the quality of measured response of the structure in unfavorable lighting conditions by utilizing an adaptive region of interest-based image processing algorithm.

Although some of the studies discussed up till now were utilizing computer vision-based techniques to capture the deformation in a structure a small number of features were detected in a temporal manner and then converted into the final deformation or deflection values. The detection of corrected pixel values is not an easy task since most the studies rely on the application of virtual markers. There has been a recent development in the machine vision realm that involves non-contact-based virtual fiducial system (AprilTag) that can be automatically detected and provides a robust pixel value against any point of interest in the image frame. Some of the recent applications of fiducial markers in camera pose estimation are discussed in this work. Li [8] proposed a method of fusion the information from RGB-D images to get the robust measurements for the camera pose with help of fiducial markers. The study used RGB-D images captured by the combination of an RGB camera along with the depth capturing sensor. They concluded the improvement in translation but the error in rotation could not be minimized. Authors in [9] proposed a method of averaging the rotations captured in the image data and then the results were optimized that reduced the error in camera positioning. The proposed method was compared with other methods. Some researchers in [10] had developed an approach to build a fiducial map based on manifold optimization. These fiducial maps were then used for pose estimation and the accuracy of the results was tested against the ground truth from a motion capture system. The RMSE was found to be 3.1 cm. In regard to 3D mapping of fiducial markers, [11] studied a framework called “AprilTag3D”. AprilTag3D is a setup of at least 2 tags that were not laying on the same plan but were connected with each other along an axis. In the proposed methodology, the researchers used RGB images and a third dimension to marker detector to improve Apriltag pose estimation. The proposed method improved the detection and pose estimation, especially in highly reflective environments. Gupta and Hebbalaguppe [12] presented a do-it-yourself wearable device coupled with a webcam that can provide 6 DOF per fingertip tracking in real-time. The authors used ViSP, which had the AprilTag module built-in. Abbas et al. [13] proposed a three-step pose estimation. These steps include (1) correction in trigonometric correction of yaw angle, (2) use of custom made gimbal for the improvements in yaw angle, and (3) utilizing a Gaussian process-based regression of experimental data a pose-indexed probabilistic sensor error model of AprilTag was presented and the validation was particle filter tracking.

Up till now, the applications of robotic vision in pose estimation of different objects have been discussed. Since the focus of this research is to deploy and use effectively the fiducial markers (a type of computer vision-based technology) in the domain of Civil Engineering (specifically in Structural Engineering). Therefore, a brief overview of current developments of these computer-vision-based techniques in the domain of Civil Engineering is presented here. Starting from Valença and Carmo [14] in which the researchers had proposed a method based on photogrammetry and image processing known as “Photo-Node”. This one type of non-contact method was used to monitor the concrete surfaces and to measure the relevant physical parameters of reinforced concrete beam-column joints during experimental testing phase. The proposed method was used to exploit post-processing features to evaluate joint distortion, principle directions, joint rotation, and stress along with developed crack patterns and load paths. In another attempt to use computer vision-based technique to monitor the damage in a structure, the authors of Saravanan and Nishio [15] had carried out experiments using low-cost cameras and 2D digital image correlation (2D-DIC). The quasi-static loading was applied to the specimens (aluminum plate girders bridge model) in an indoor environment in a bending test and the in-plane strain field variations near the endpoint of girders were estimated in different boundary conditions. Experimental strains obtained in the tests were then compared with the strains resulting from 2D-DIC. The authors also suggested extending the study by comparing the results with the finite element analysis (FE analysis) as an extension of the study. In an attempt to measure the structural deformation using one of the most feasible approaches, researchers in Kromanis et al. [16] proposed a study that examined the performance of the structure by monitoring the structural deformation by using the video/images captured from smartphones when the structure (i.e., beams) was subjected to the static, dynamic and as well as quasi-static loading in a laboratory environment. The conclusions were made based on the RMSE of deviation between the results of image analysis methods and the measurement collections. In Xiang et al. [17] an approach focused on information shared between the cameras is used to increase the robustness of the previously developed algorithm. The authors of the study emphasized the possibility of a camera position-independent imaging approach of the condition assessments of structures focusing on the fact that the structural features remain the same between the cameras even if the images are captured from different angles of camera positions. The DeforMonit application technique is developed at Nottingham Trent University by R Kromanis and A Al-Habaibeh [18] to evaluate target displacements.

The main objective of this study is to present a framework to estimate the pose of a 3D object in the real world by deploying a multi-camera model approach. The robustness in the algorithm is provided by the bundle adjustment method that is used for the scene reconstruction from multi-view cameras. In this paper, the first section has provided a broad horizon of the current state-of-the-art of using machine vision and fiducial markers in structural health monitoring purposes. The second section will discuss the methodology that is used to implement the proposed techniques, the verification of the proposed methodology is then provided in experimental section in which the experimental layout will be discussed followed by the results and discussion part where the analysis and conclusions will be drawn and potential future works will also be discussed in the future work section.

13.2 Methodology

13.2.1 Pinhole Camera Model

In computer vision-based studies, understating camera models and their characteristics is of great importance. In order to understand the projection of the 3D world coordinate system (WCS) into 2D sensor coordinate system (SCS), the pine-hole camera model is discussed in this section.

As shown in Fig. 13.1a, a light beam illuminated by a point (\(\mathscr {P}\)) in 3D WCS can be projected into 2D space of image coordinate system (ICS) and sensor coordinate system (SCS). As the first stage, translation and rotation are performed for coordinates of any point \(\mathscr {P}\) () to transform it from WCS to camera coordinate system (CCS) () as shown in Eq. 13.1. At the further stages, the point is projected onto 2D image coordinate system (ICS) (denoted as i x p or (i x p,i y p)) and the sensor coordinate system (SCS) (denoted as s u p or (s u p,s v p)).

Fig. 13.1
figure 1

Schematics of (a) projection of a point (P) in 3D world coordinate system (WCS) to p in 2D sensor coordinate system (SCS), (b) three cameras setup focusing on four points

The transformation from WCS to CCS is expressed as:

(13.1)

where is the location of camera origin (pinhole camera) in WCS.

The equation above is the same as:

(13.2)

where \({ }^{c}{w}{\mathbf {R}} \in \mathbb {R}^{3\times 3}\); the homogeneous coordinate of point \(\mathscr {P}\) () is represented by its Euclidean coordinate (): . ,

The Euclidean (rigid) matrix, \({ }^{c}_{w}\mathbf {H}\), is defined as:

$$\displaystyle \begin{aligned}^{c}_{w}\mathbf{H} = \begin{bmatrix} {}^{c}{w}{\mathbf{R}} &-{}^{c}{w}{\mathbf{R}} {}^{w}{}{\boldsymbol{X}}_{C} \\ {\mathbf{0}}^T &1 \end{bmatrix}\end{aligned}$$

The transformation from CCS to ICS can be expressed as:

(13.3)

where and so as to convert homogeneous coordinates (i.e., ) to Euclidean coordinates (i.e., ).

The projection matrix, M proj, is defined as:

$$\displaystyle \begin{aligned}{\mathbf{M}}_{\text{proj}}= \left[ \begin{array}{cccc} f_x & 0 & 0 & 0 \\ 0 & f_y & 0 & 0 \\ 0 & 0 & 1 & 0 \\ \end{array} \right]\end{aligned}$$

.

The transformation from ICS to SCS is expressed as:

(13.4)

The affine matrix, M affine, is defined as:

Hence, the overall projection from WCS to SCS can be expressed as:

where which means and .

13.2.2 Bundle Adjustment Algorithm

The 3D reconstruction of a point from 2D in 3D has been an active direction within the photogrammetry realm. There are two main frameworks as described in the work [19]. One framework is referred to as “Structure from motion” (SfM) that is a unifying framework that works on the principle of estimating the motion and the structure simultaneously, while the other framework is referred to as decoupling framework, also known as motion estimation where the estimation of structure and motion are performed separately. To work with the decoupling framework a prototypical approach based on epipolar-constrained is utilized. In this approach, the space of essential matrices is used as the optimization criterion where the deviation is due to the measurements errors (especially the noise in images) [20]. This type of approach has a prominent benefit on unifying framework method since it works on a low-dimensional search space. However, the drawbacks of this approach include statistically biased estimation of translation [21, 22] and its sensitivity to input images noise that indicates the degraded performance of decoupling framework [23]. On the contrary, for the unifying framework, one of the most popular approaches is known as the Bundle Adjustment. This optimization approach minimizes the error between the predicted and observed pixel values.

Bundle Adjustment is used to minimize the sum of errors with the observed 2D pixel values and the measured 2D pixel values where the measured pixel are obtained as a projection function of predicted 3D points by using camera parameters. These measured 2D pixel values are obtained by using the camera model approach as described in detail in Sect. 13.2.1. In other words, the Bundle Adjustment technique is used to minimize the error between detected and projected 2D images pixels using the parameters from multiple camera perspectives. Using Fig. 13.1b, the Bundle Adjustment method can be represented as a nonlinear least square problem that can be given as:

$$\displaystyle \begin{aligned} \min_{\boldsymbol{C},\boldsymbol{X}} \, \sum_{i=1}^{n}\sum_{j=1}^{m} (\boldsymbol{u}_{ij} - \pi(\boldsymbol{C}_j,\boldsymbol{X}_{i}))^{2} {} \end{aligned} $$
(13.5)

where u ij is the observation in SCS corresponding to the i-th 3D point with coordinates w X i observed by the j-th camera C j. Let r ij be the vector:

$$\displaystyle \begin{aligned} \boldsymbol{r}_{ij} = (\boldsymbol{u}_{ij} - \pi(\boldsymbol{C}_{j},\boldsymbol{X}_{i})) {} \end{aligned} $$
(13.6)

where \(\boldsymbol {r}_{ij} \in \mathbb {R}^{1\times 2}\), The points “u ij” and “\(\hat {u}_{ij}"\) are the observed and measured 2D pixel value of a same point from a camera.

Therefore, Eq. 13.6 can be written as:

$$\displaystyle \begin{aligned} \min_{\boldsymbol{C},\boldsymbol{X}} \, \boldsymbol{r}^{T}\boldsymbol{r} {} \end{aligned} $$
(13.7)

where r = [r 11, r 12, ..., r 1m, ..., r n1, r n2, ..., r nm]T and \(\boldsymbol {r} \in \mathbb {R}^{2mn\times 1}\).

Vector r could be expanded by using Taylor’s expansion as:

$$\displaystyle \begin{aligned} \boldsymbol{r}(\boldsymbol{x}+\delta\boldsymbol{x})=\boldsymbol{r}(\boldsymbol{x})+\boldsymbol{g}^{T}\delta \boldsymbol{x}+\frac{1}{2} \delta \boldsymbol{x}^{T} \mathbf{H} \delta \boldsymbol{x} {} \end{aligned} $$
(13.8)

By taking the derivative of Taylor’s expansion in Eq. 13.8 and equating it to zero, we can get:

$$\displaystyle \begin{aligned} \mathbf{H}\delta \boldsymbol{x}=-\boldsymbol{g} {} \end{aligned} $$
(13.9)

where H is Hessian and g is the gradient of vector r. Therefore, H can be replace by J T J and g by J T r. The simplified equation can be expressed as:

$$\displaystyle \begin{aligned} \boldsymbol{J}^{T}\boldsymbol{J}\delta \boldsymbol{x}=-\boldsymbol{J}^{T}\boldsymbol{r} {} \end{aligned} $$
(13.10)

13.2.3 Levenberg-Marquardt Algorithm

Because the Bundle Adjustment algorithm usually deals with large error minimization problems, an iterative optimization tool is needed to reduce the global loss until it converges. The Levenberg–Marquardt (LM) algorithm is one of the popular tools. The main reason for selecting LM as here as a solution to this nonlinear problem is its ability to work as a gradient descent method and as the Gauss–Newton method. This algorithm adaptively varies parameters updates between the gradient descent method and the Gauss–Newton method. Initially, when the parameters are far away from the optimal value, it behaves as a gradient descent method; when the solution converges close to the optimal value it behaves like the Gauss–Newton method.

To understand LM algorithm, let us consider a function, f, that maps the parameter vector p and calculates the measurement pixel vector \(\hat {\boldsymbol {x}} = f(\boldsymbol {p})\). The parameter vectors is a multi-dimensional vectors consisting of camera rotation R, translation, t and 3D location (w X) in WCS. Therefore, the parameter vector is represented as p(R, t,w X)

$$\displaystyle \begin{aligned} \hat{\mathbf{x}} = f(\mathbf{R},\mathbf{t},{}^{w}{}{\mathbf{X}}) {} \end{aligned} $$
(13.11)

The purpose of LM algorithm is to find the optimal vector p + so that the re-projection error \(\epsilon = \mathbf {x} - \hat {\mathbf {x}}\) is minimized.

To solve Eq. 13.10, we use the LM algorithm and the expression can be derived as,

$$\displaystyle \begin{aligned} (\boldsymbol{J}^{T}\boldsymbol{J}+\lambda{I}) \boldsymbol{\delta{x}}=-\boldsymbol{J}^{T}\boldsymbol{r} {} \end{aligned} $$
(13.12)

Equations 13.10 and 13.12 can be solved directly through algebra system of equations but for the larger system of equations, e.g., for one thousand cameras with two Million 3D points, the solution will be impractical. Therefore, for the sake of convenience the optimization problem is solved by using Eq. 13.10. The predicted pixel value is \(\hat {u}_{ij} = \pi (\boldsymbol {C_{j}},\boldsymbol {X_{i}})\) and the parameter block (C j , X i) is further categorized into two sub-block, i.e., the camera block “c” and the structure or 3D point block “p” as x = [c, p].

For the sake of simplicity, we consider an example of 3 cameras (m = 3) that are observing a set of 4 3D points (n = 4). Therefore, to construct the Jacobi matrix for this assumed condition we have \({A_{ij}} = \frac {\partial {\hat {u}^{\prime }_{ij}}}{\partial {c_{j}}}\) and \({B_{ij}} = \frac {\partial {\hat {u}^{\prime }_{ij}}}{\partial {p_{j}}}\) therefore, the sparse Jacobi matrix will be constructed as shown in Eq. 13.13 below.

$$\displaystyle \begin{aligned} \boldsymbol{J}= \frac{\partial{\hat{u}}}{\partial {x}} = \begin{bmatrix} {A}_{11} &0 &0 &{B}_{11} &0 &0 &0 \\ 0 &{A}_{12} &0 &{B}_{12} &0 &0 &0\\ \vdots &\vdots &\vdots &\vdots &\vdots &\vdots &\vdots \\ 0 &{A}_{42} &0 &0 &0 &0 &{B}_{42} \\ 0 &0 &{A}_{43} &0 &0 &0 &{B}_{43} \end{bmatrix} {} \end{aligned} $$
(13.13)

However, the camera calibration step is performed beforehand in the present research. Therefore, the camera parameters A ij are taken as constant for all the three cameras used here in this paper. The modified Jacobian matrix will only consist of B ij since the point parameters (3D WCS) are unknown.

13.3 Experimental Setup

This section will provide a detailed description of the experiment that is conducted in an indoor laboratory environment to implement the methodology that is described in Sect. 13.2.2. To examine the bundle adjustment techniques for multi-perspective sensing in a controlled environment, the 3D point of an object is viewed from 3 different cameras at the same time. Please note that the cameras are calibrated by using the famous checkboard calibration technique [24] before the experiment is performed. A total of three cameras are used in this experiment as shown in Fig. 13.2a and b. Sony Alpha 6000 (camera#01) and Sony Alpha 6400 (camera#03) are deployed at the extreme left and extreme right of experimental layout, respectively. Whereas a GoPro HERO10 (camera#02) is placed in between the two Sony cameras. After the calibration, the errors in internal calibration of cameras #01, 02, and 03 are observed to be 0.054, 0.062, and 0.033 pixels respectively. Errors in external calibration of cameras #01, 02, and 03 are found to be 4.371, 8.285, and 3.200 pixels respectively.

Fig. 13.2
figure 2

Photos of (a) experimental setup for evaluation, (b) object of interest (i.e., the white board as foreground) placed on the workbench (i.e., background), and schematic view of (c) the experimental setup

13.3.1 Tag-Based Detection Method

A white board attached with fiducial markers is selected as the object of interest. A total of 9 fiducial markers (i.e., AprilTag of family t36h11) are attached to the face of the object (i.e., foreground). Additionally, 32 AprilTags of family t25h9 are attached to the background (i.e., the workbench). The purpose of using separate background and foreground tag families is to achieve an undisturbed pose for each of cameras during the experiment. The center point of the work bench (i.e., the background in this case) is taken as the origin of the world coordinate system so that the position of each fiducial maker is known in WCS to serve as ground truth values. In the current experimental setup, since the orientation of the board (i.e., the foreground) is to be used for 3D reconstruction while capturing the images, it is important to keep the cameras and the background static. Hence, no motion is allowed in the camera position. A total of 32 fiducial markers (AprilTag of family t25h9 with size 30 cm) are used to identify the poses of all three cameras used in this experiment. The tags are attached to the background (static) board as can be seen in Fig. 13.2b. After all the three cameras and the background are fixed, the foreground (i.e., the board) is focalized as the region of interest is focused. A total of 9 fiducial markers (tag family of t36h11 with size 30 cm) are attached to the object of interest as foreground tags.

13.3.2 Deployments of Cameras

The cameras are placed in an orientation such that all the 3 cameras are able to capture any movement that is performed by the object, as shown in Fig. 13.2a and c. Additionally, to improve the visual conditions, 2 LED lights are used to improve the data that is being collected during the experiment.

13.3.3 Concrete Beam Bending

As an application of the proposed technique for identification and rectification of deformation in structural members using a non-contact approach, a monocular camera setup is used to quantify the deformation of beam during destructive four-point bending test. In this experiment, the specimen constructed from Ultra-High Performance Concrete with the dimensions of 9 × 3 × 3 inch (228.6 × 76.2 × 76.2 mm) is used. A total of 9 AprilTag (from the family of t25h9), evenly deployed on the face of the specimen. Sony Alpha6400 camera with a focal length of 50 mm is used to capture the data. The camera is calibrated beforehand by using checkboard calibration. Additional LED lights are deployed around the experimental area to improve the lighting conditions.

Ultra-High Performance Concrete (UHPC) is a special type of concrete. It is specially designed for extreme loading conditions because, unlike the normal concrete that has a compressive strength of around 4 to 6 ksi, UHPC is designed to achieve the strength of 18 to 22 ksi. The UHPC used in this study is designed using steel fibers (of length 7 mm) providing the necessary reinforcement along with cement and some admixtures. UHPC has the tendency to self-compact as compared to normal concrete that requires additional means of compaction. Therefore, it is referred to as self-compacting concrete. However, the properties come at a high cost. Compared to normal concrete, the UHPC costs 10 times higher. It also requires extra attention during the pouring and curing stages, for example, during the pouring process, it has to be poured as soon as possible because it hardens quicker than ordinary concrete and in the curing stage it requires a fresh round of water to be sprayed during its initial 24 hours of hardening time.

13.4 Result and Discussion

13.4.1 Tag-Based Detection

The AprilTag detection results is shown in Fig. 13.3 for the evaluation experiment. Figure 13.3(a)–(c) show detected pixel locations (red dots) and projected pixel location (blue circles) of the AprilTags attached to the background/workbench on the camera#1-3, respectively. A sample background (green square) and foreground (pink square) tags detection from camera#01 is shown in Fig. 13.3(d). The AprilTag detections results on concrete beam [shown in Fig. 13.4(a)] are performed before loading and after loading, as shown in Fig. 13.4(b)–(c), respectively. Figure 13.5a–c show the selection of randomized 3D location as an initial step to start the least square optimization algorithm (LM) for calculation of pixels (measured pixel) as a function of camera parameters and 3D world locations. Results in Fig. 13.5d to f show optimized 3D location that is obtained at the end of the least square estimation. The initial error is reduced from 400 (in pixels) to about 26 pixels at an optimal solution. It is found that after the least square optimization the reconstructed word coordinates of 45 points are in good match with the provided ground truth. Hence, the proposed approach achieved an accurate estimation of structural deformation using multi-view reconstruction.

Fig. 13.3
figure 3

Images of (a) detected pixel location (red dot) and projected pixel location (blue circle) on the background, (b) detected (red dot) and projected (blue circle) pixels on the background after calibrating camera#02, (c) detected (red dot) and projected (blue circle) pixels on the background after calibrating camera#03 and, (d) sample background (green square) and foreground (pink square) tags detection from camera#01

Fig. 13.4
figure 4

(a) Schematic view of beam with fiducial markers attached on the surface, and images of (b) beam with tag detection before loading, (c) beam with tag detection after loading

Fig. 13.5
figure 5

Plots of (a), (b), (c) initialized random XYZ (3D) location () to start the optimization algorithm, (d), (e), (f) optimized values in a 3D location () corresponding to each point

13.4.2 Concrete Beam Bending

The vertical deflection result is captured by the proposed non-contact tag detection approach, as shown in Fig. 13.6. 5 out of 9 tags that are attached to the specific location of the beam’s surface are presented. Tag-0,1,2, and 3 are attached to the top left, top right, bottom left, and bottom right corner of the beam respectively, while tag#5 is attached to top center of the beam. These values are compared with the vertical deflection values provided directly by UTM as shown in Fig. 13.6a.

Fig. 13.6
figure 6

(a) Load vs displacement from UTM, (b) Vertical deflection in Tag-A (Top left), (c) Vertical deflection in Tag-B (Top right), (d) Vertical deflection in Tag-C (Bottom left), (e)Vertical deflection in Tag-D (Bottom right), (f) Vertical deflection in Tag-F (Top center)

The results from Fig. 13.6b–d provide good agreement with that of from the UTM (i.e., closed to 6 mm). While from Fig. 13.6e, it can be anticipated that since the tag#5 is attached to the mid-point of the beam at the top, therefore, after a while in the experiment, this tag, along with the beam, is moved downward as the load is exerted from the upper loading heading of UTM. Therefore, we can see an increase in vertical deformation at tag#5 (tag-F) that later converts into a curve and then reduces once the mid-section of the beam starts going down in vertical direction.

13.5 Conclusion and Future Work

The study aims to provide a full-field, non-contact measurement approach for a four-point bending test on UHP concrete beams using tag-based robotic vision. An evaluation experiment is carried out on a board attached with AprilTags using images taken from multiple cameras with overlapped perspectives. Shared feature points are detected using tag detection algorithms. The 3D reconstruction is treated as a Bundle Adjustment problem and the optimization is conducted by the Levenberg—Marquardt algorithm. Primary results show a good match between location estimation and ground truth. As the extension of the current work, the authors will reconstruct the 3D geometry and deformation by using multi-perspective imaging system and a more robust algorithm.