1 Introduction

A large number of structures and significant parts of existing infrastructures built in the last century are now close to, or even beyond, the end of their design lifespan. Due to inherent defects that relate to fabrication and construction processes, as well as owing to the cyclic nature of operational (e.g., traffic) loads, a common pathology for steel infrastructures concerns the formation of fatigue cracks. Such cracks accelerate structural deterioration and, in the absence of proper inspection and maintenance, may eventually lead to failure [1].

Structural health monitoring (SHM) techniques have been proposed as a complementary practice to on-site visual inspections, which typically require substantial human resources and suffer from bias, as the inspectors’ assessment is subjective [2]. SHM techniques offer a means for reducing such bias, via complementary exploitation of evidence stemming from data. This evidence is built up via extraction of engineered features that are sensitive to damage and subsequent establishment of rigorous metrics which serve for damage detection and, possibly, quantification. To this end, vibration-based techniques, mostly based on ambient recordings, have gained popularity especially for bridges and slender structures [3, 4]. However, the parameters obtained using such methods generally refer to the global behavior of structures; thus, being less sensitive to local damage. In addition, information on localization may only be delivered underutilization of dense sensor networks [5]. The deployment of such dense networks on large-scale civil structures, however, leads to issues related to data management and instrumentation costs. In an attempt to relax constraints relating to cost and cabling implications, wireless sensing technologies have been adopted, requiring strict data synchronization [6,7,8].

When wishing to detect local defects, particularly manifestation of cracking, vision-based procedures offer a valuable enabler [9,10,11,12] due to their effectiveness in identifying structural anomalies, via use of cameras directly deployed on the structure, mounted on drones, or used by expert operators [13, 14]. Vision-based methods initially relied exclusively on image processing techniques to detect superficial defects, such as corrosion and cracks. In this context, Abdel-Qader et al. [15] offer a comparative study considering the Haar wavelet transform, the Fourier transform, and two edge detectors (namely, the Sobel and the Canny detectors) for crack identification in bridges. Edge detection methods have dominated early literature in this domain [16, 17]. However, this technique may lead to ill-posed problems, since lighting and image noise may considerably affect the analysis result, eventually yielding inaccurate crack detection [11]. Although denoising techniques may increase accuracy, their efficiency is strongly case dependent [11, 18]. A review on image processing techniques used for crack detection can be found in reference [19].

With the development of high-performance graphics processing units (GPUs) and parallel computing [20], an ever-increasing number of applications are employing machine learning and, in particular, neural networks for vision-based damage detection [21]. Convolutional neural networks (CNNs), inspired by the functioning of the visual cortex of animals, are commonly used in the vision-based field. In general, machine learning tools in vision-based applications can identify cracks at the image, block, or pixel level, according to the capability of a given algorithm to detect either the presence of a crack in an inspection image, determine the approximate cracked region, or identify the precise pixels where the crack is, respectively [22]. As an example of successful applications of machine learning in vision-based crack detection, Li and Zhao [23] modified the AlexNet convolutional neural network to classify image regions as either “cracked” or “non-cracked”; thus, achieving a block-level validation accuracy of 99.06%. Similarly, Zhang et al. [24] employed a CNN able to classify small image regions with a precision of 86.96% for road crack detection applications. Cha et al. [11] presented a study on concrete cracks, obtaining 98% accuracy on the binary classification of 256 \(\times\) 256-pixel regions as either “damaged” or “undamaged” areas.

Recently, the pixel-level identification class has gained increasing interest, since the accurate identification of crack pixels can quantify the crack size and, thus, the entity of damage. Dung and Anh [25] employed an encoder–decoder network (called “U-net”) trained end-to-end using a dataset of pixel-level annotated images to identify pixels related to reinforced concrete cracks. A similar network was employed by Huyan et al. [22] to identify cracks on road pavements with 99.01% accuracy. In most of the mentioned studies, the material background was relatively uniform and the noise sources were limited to changing lighting conditions. Xu et al. [26] used a deep fusion CNN exploiting the combination of multilevel features to achieve an overall accuracy of more than 95% in the validation phase. These results were obtained in an environment with complex disturbances given by crack-like handwriting scripts, different paint, and exposure conditions; a setting which is typically met on actual bridge systems. Other techniques [14, 27] employ machine learning-based techniques to achieve super resolution and improve crack detectability in blurred image conditions.

In further overviewing relevant machine learning-based schemes, it is worth noting the work by Dung and Ahn [25], who presented a method based on a convolutional autoencoder to directly detect cracks in photos of concrete elements. The authors also underlined the difficulty in estimating the crack size due to different noise sources present in inspection images. Further researchers attempted to characterize the identified cracks in terms of length, width, and area over the last few years. For instance, Jin et al. [28] calculated the morphological skeleton of identified crack areas and employed a flexible kernel to estimate their size.

Although much effort has been made to improve the identification accuracy, supervised machine learning approaches that provide pixel-level results require a time- and labor-intensive labeling process to build a training dataset: collect photos of damaged elements, perform manual and accurate offline labelling of the pixels corresponding to cracks, and train the CNN for further in situ identification. Also, the optimal training dataset is generally application-specific, i.e., different training sets are necessary for each case study since materials, lighting conditions, and defect types can be different. Moreover, the input of CNN-based techniques typically consists of the entire inspection image. Qiao et al. [29] used a densely connected CNN with added upsampling layers to generate output images with the same size as the input images, which, however, involve considerable computational cost, as do the complex U-net and AlexNet architectures, largely used for vision-based applications [23, 25].

This study proposes a light crack detection tool that can be easily adapted to diverse case studies due to the low effort required to build the training set. The proposed method is devised to offer an alert of damage, together with useful quantitative information, based on photos collected manually by operators or in an automated fashion by unmanned aerial vehicles (UAVs). A central element of this study lies in minimization of false alarms triggered by different disturbance sources, such as varying light conditions and crack-similar elements (e.g., handwritten scripts). The proposed method aims at achieving a pixel-level identification by means of a hybrid algorithm based on a CNN operating at the block level and edge detection procedure, which allows identifying pixel-level cracks within the identified blocks. Compared against state-of-the-art approaches, this procedure reduces the labor related to the preparation of the training set, since the training images need to be labelled only at block level. Also, the input of the CNN is a set of small images (specific regions of the original inspection photos) that require a simpler architecture when compared against schemes in existing literature [11, 13, 22, 23, 25, 30]. Moreover, the edge detection process is applied to (small) selected regions; thus, comprising lower computational complexity than a traditional application of the same process to the entire inspection image.

Quantitative characteristics of the identified cracks (namely, total length and average width) can be identified using the proposed algorithm to provide operators with preliminary indicators of the damage entity. These quantities are expressed in pixels and can be converted to physical measurements using well-known methods, depending on the hardware employed. For instance, if the inspection camera is monocular, photogrammetry techniques can be employed to retrieve measurements from a sequence of images taken from different angles [31, 32]. Also, recent studies have demonstrated that the use of binocular cameras (generally, a regular high-resolution camera coupled with a depth sensor camera) can be effectively employed to obtain distance information from a single photo [33, 34].

The photos of fatigue cracks collected on the steel box girders of an in-service long-span bridge in China are used to validate the proposed approach. The data employed in this paper were provided in the framework of the 1st International Project Competition for Structural Health Monitoring (ICP-SHM, 2020) [35].

2 Proposed method

We propose a two-step approach (Fig. 1) to localize pixels corresponding to cracks within an image of a structural element/detail. These pixels will henceforth be referred to as “crack pixels”, while we refer to the photos of structural elements as “inspection images”. The first step involves identification of rectangular regions (of a fixed, user-defined size)—within the inspection image—that contain the crack. Specifically, the inspection images are divided into a grid of smaller regions without overlapping (in this study, 32 \(\times\) 32 pixels), which are then classified as “damaged” or “undamaged” regions using a CNN that admits single regions of the inspection images as input. Image processing techniques are subsequently applied, in the second step of the procedure, only to previously identified “damaged” areas to localize the crack pixels accurately. In this way, the computational burden of image processing is minimized, and pixel-level crack identification can be easily obtained, although the training process is carried out at a block level. Depending on the nature of application and objective, Step 2 may be omitted. Moreover, between the two main steps, a cleaning procedure based on morphological operations can be applied to the outcome of the CNN to reduce false positives (FPs) and false negatives (FNs), maximizing true positives (TPs) and true negatives (TNs). Herein, “positives” denote the cracked regions, while “negatives” the pristine material, while “true” and “false” denote correctly or incorrectly identified samples, respectively.

Fig. 1
figure 1

Outline of the proposed procedure for crack detection

In the next few sections, the methodology briefly described above and schematized in Fig. 1 is outlined in detail. Section 2.1 explains the pre-processing operations necessary to train the CNN. Sections 2.2 and 2.3 describe the two steps of the identification procedure that ultimately yields an estimate of the crack length and width.

2.1 Pre-processing of inspection images and training of the CNN

In general, for CNN applications, the input image size strongly affects the computational runtime both for network training and, more importantly, for the classification of new datasets. However, in most machine-learning-based procedures for crack identification, images representing large areas are directly used as input [29, 30]. In other applications, inspection images are processed using moving windows; thus, considering regions of reduced size. In particular, Cha et al. [11] used 256 \(\times\) 256 pixel windows to identify cracked areas in reinforced concrete structures. However, when different noise sources disturb the images, the efficacy of crack identification is correspondingly hindered. The consideration of targeted localized image frames may facilitate distinguishing between actual cracked areas and crack-like noise sources. In particular, when the structural element images include handwritten scripts or when the material texture is wrinkled (e.g., welding joints), considering image regions that are a few times (i.e., 5–10 times) larger than the expected crack width can facilitate the identification of structural flaws. For this reason, in this study, the inspection images are preliminarily divided into 32 \(\times\) 32-pixel regions that are provided as input to the proposed crack identification method. The processing of these small regions also involves faster computation, since the CNN has a 32 \(\times\) 32 input layer and thus, requires few convolutional layers for classification. Figure 2 shows the architecture of the CNN employed in this study. Each layer is identified using an identification code (ID), which is described in Table 1.

Fig. 2
figure 2

Architecture of the neural network for binary classification of regions of the inspection images as “damaged” or “undamaged”

Table 1 Description of the CNN layer architecture

Since Step 1 employs a CNN for region classification, it requires a preliminary training phase using labeled images, i.e., photos where the crack location is known. However, it should be noted that, contrary to existing methods, a rough knowledge of cracked areas is sufficient, considering that the labels are assigned to the 32 \(\times\) 32 regions; this implies alleviation of the need to precisely highlight the exact crack pixels. Therefore, the training dataset consists of a set of regions labeled as “damaged” or “undamaged”. This characteristic considerably simplifies the labeling process, which can be done quickly by operators using, for example, a capacitive pen on a tablet.

The inspection images are first pre-processed and divided into regions to generate a dataset suitable for training the CNN. Specifically, based on the assumption that color information in inspection images does not contribute to identifying cracks, since different lighting conditions may distort it, the original inspection images are first transformed into grayscale figures. This process further simplifies the region classification, since the greyscale input data consists of a single-color layer instead of the three-color layers of the red–green–blue (RGB) coding; thus, reducing the required filters in the convolutional layers. Each grayscale image of the training set is then divided into 32 \(\times\) 32-pixel regions forming the training samples for Step 1. It should be noted that, to increase the size of the training dataset, overlapping may be considered. However, in this study, the inspection images are divided using a regular grid without overlapping. A CNN with the structure shown in Fig. 2 is, thus, trained using the generated dataset.

2.2 Step 1: selection of the damaged regions

The trained neural network should be able to discern between damaged and undamaged 32 \(\times\) 32 regions of the inspection image. To process a new dataset to identify the damage regions in unlabeled inspection images, a pre-processing procedure, consisting of desaturation and splitting (as described in the previous section), must be preliminarily performed. Hence, a set of testing 32 \(\times\) 32 regions compatible with the trained CNN is generated. Each unlabeled input region can then be classified as either “damaged” or “undamaged”.

At this point, a strategy based on morphological operations can be employed to mitigate FPs and FNs outcomes that may result from diverse sources of noise, as shown in Fig. 3. To this aim, a map of the classified regions in the inspection image is first generated using Boolean variables. Specifically, a matrix with dimensions \(p\times q\) is built, where \(p\) and \(q\) are the number of 32 \(\times\) 32 regions that form the inspection image in the vertical and horizontal direction, respectively. Each element of this matrix is set to either 0 or 1 (binary value), corresponding to a region being classified as “undamaged” or “damaged”, respectively. A cleaning procedure is, thus, performed by converting isolated damaged segments (i.e., instances of damage that are surrounded by undamaged segments in all directions, including the diagonals) into undamaged ones. Moreover, a dilate morphological operation is performed to minimize discontinuities in damaged areas. In particular, for all the connected regions in the Boolean map (considering 8-connected pixels, i.e., sets of neighboring pixels that touch each other edges or corners), an ellipse having the same second-moment as the analyzed connected region is evaluated. Then, its orientation (i.e., the angle between the horizontal axis and the major axis of the ellipse) is calculated. The dilate operation is, thus, performed using line structuring elements with a length of 3 pixels and an angle given by the average of the orientations calculated throughout the map. An exhaustive theoretical explanation on the dilation and, more generally, on morphological operations is not reported here for the sake of brevity. However, interested readers can find more details in [36].

Fig. 3
figure 3

Morphological operations performed to reduce incorrect classifications

The final mask that identifies the damaged regions is then obtained by reconverting the corrected Boolean map into a matrix with the same size as the inspection image. The regions selected by this mask can then be processed through Step 2 of the proposed approach to identify the crack pixels. It should be noted that the identified crack regions typically form a small portion of the inspection image. As a result, the number of pixels processed in the next step of the procedure is considerably reduced, decreasing the computational runtime considerably.

2.3 Step 2: crack detection and characterization

In many practical applications, a rough approximation of the crack size is sufficient to prioritize inspections or raising alarms when limited resources are available. Nevertheless, accurate identification of cracked areas, including precise identification and characterization of crack pixels, remains the ultimate goal of crack detection approaches. In this study, only the damaged regions identified in the previous Step 1 are examined to highlight the crack pixels. Step 2 includes a crack detection and a crack characterization sub-step.

Crack detection The first process is applied to localize the crack pixels in the identified damaged regions (Fig. 4a). To this end, the Sobel edge detection method is employed [37]. This method involves calculating the gradient \(G\) of a given identified region \(A\) as

$$G = \sqrt {G_{x} + G_{y} } ,$$
(1)
Fig. 4
figure 4

Crack detection and characterization process: a identified damaged region, b edge detection, c filling and erosion, d topological skeleton, e distance transform, f final distance values

where \(G_{x}\) and \(G_{y}\) are two images describing the row and column gradient approximations of \(A\), respectively, calculated as the following two-dimensional convolutions:

$$G_{x} = \left[ {\begin{array}{*{20}c} 1 & 2 & 1 \\ 0 & 0 & 0 \\ { - 1} & { - 2} & { - 1} \\ \end{array} } \right]*A,$$
(2)
$$G_{y} = \left[ {\begin{array}{*{20}c} 1 & 0 & { - 1} \\ 2 & 0 & { - 2} \\ 1 & 0 & { - 1} \\ \end{array} } \right]*A,$$
(3)

where \(*\) is the convolution operator.

To facilitate further processing, a Boolean map is generated from \(G\) by assigning 0 and 1 values to the pixels that are below and above a threshold, which can be defined based on the signal-to-noise ratio (SNR) of the processed image [38] (Fig. 4b). Here, the threshold is automatically set to twice the root mean square (RMS) of \(G\). Two further morphological operations are subsequently performed, one consisting of filling, which replaces 0 values with 1 values in closed regions, and the second involves erosion, which removes small objects and smoothens the boundaries of identified areas. Such areas contain the identified crack pixels (Fig. 4c).

Crack characterization Upon localizing the crack pixels, the width and length of cracks are estimated. First, the topological skeleton of the Boolean map obtained from \(G\) is calculated using the medial surface axis thinning algorithm [39] (Fig. 4d). Then, the distance transform [40] of the crack pixels is computed, assigning to each pixel a value that represents its distance from the closest point of the identified skeleton, as shown in Fig. 4e. In particular, the Euler distance between the centers of the pixels is employed as distance metric.

The crack length is then calculated as the number of skeleton pixels (i.e., the sum of the pixels in Fig. 4d). On the other hand, a global statistical distribution of the crack width for a given inspection image is calculated as the probability distribution of all the distance values lying on the edge of the identified crack (Fig. 4f). Specifically, the edge values are multiplied by 2 before obtaining the crack distribution, as it is assumed that the crack is locally symmetric with respect to its skeleton. In this study, the crack distribution is obtained by counting how many times each width value is identified in a given figure and representing these results on a width-occurrence diagram. Moreover, a representative value (e.g., the mode or the median of the width statistical distribution) can be calculated to synthetically characterize the crack width.

3 Application

Automatic crack detection is a valuable tool to support and complement traditional visual inspection processes. In this section, the applicability of the method proposed in Sect. 2 is evaluated, with crack detection and characterization examples. The two-step procedure is applied to a real set of photos collected in the proximity of welding joints of a cable-stayed bridge in China [26, 35]. Due to the initial defects of the material and the dynamic load of passing vehicles, fatigue cracks have nucleated around the welding joints over the years. The data employed in this paper were provided in the framework of the 1st International Project Competition for Structural Health Monitoring (ICP-SHM, 2020) [35]. Specifically, a set of 100 randomly selected images (within the set of 120 labeled available images), with a size of 4928 \(\times\) 3264 pixels, collected with different camera parameters and lighting conditions, have been used to train the CNN, together with the corresponding labels. A high-resolution consumer-grade camera (Nikon D7000, with a 23.6 \(\times\) 15.6 mm CMOS sensor) was used to capture the photos from a distance varying between about 1–2 m. The labels are Boolean pixel maps that have ones and zeros in crack and non-crack areas, respectively. The labels were generated by manually selecting the pixels corresponding to crack areas.

To prepare the data for the training process, each figure is first converted into a single-layer greyscale image and then divided into 32 \(\times\) 32-pixel regions. In Fig. 5, three examples of both damaged and undamaged regions are shown for illustration. It can be observed that varying lighting conditions result in significant differences between greyscale samples. Marker notes within the inspection images (Fig. 6a) pose an additional challenge and could affect traditional image processing procedures, such as edge detection, leading to the identification of FPs. The first step of the presented method is intended to mitigate this effect by examining sufficiently small regions, where cracks are clearly distinguishable. In this application, the training process of the CNN reported in Fig. 2 was conducted through 20 epochs using the Adam optimization algorithm [41], with a constant learning rate equal to 0.005, denominator offset 10–8, decay rate of the gradient moving average 0.9, and batch size 128. Training is performed using the MATLAB (R2020b version) software on an Intel® Core™ i7-8700 6@3.20 GHz-processor CPU with 2 GB NVIDIA Quadro P620 GPU, 32 GB RAM, and Windows 10 operating system. The number of undamaged samples in the training process is high (1,553,803) compared to the number of the damaged ones (16,997). However, the same weights are used for both the classes in the loss function to induce robust learning with respect to the wide palette of noise sources in the undamaged regions.

Fig. 5
figure 5

Examples of damaged (a) and undamaged (b) regions

Fig. 6
figure 6

Crack detection process: a original inspection image, b Identified crack regions, c clean identified crack regions, d identified crack pixels

A step-by-step example of the identification procedure is shown in Fig. 6. Specifically, Fig. 6b shows the damaged areas (highlighted with yellow squares) before applying the cleaning process described in Fig. 3. Some FPs and negatives are present due to the uncertainties of the CNN related to changing lighting conditions, the roughness of the surface, and the marker handwrites. After applying the cleaning process, the FPs are entirely removed, while FNs are also reduced, thereby improving the overall quality of the identified regions (Fig. 6c). The identified regions of this last figure are subsequently processed through Step 2, described in Sect. 2.3, resulting in the red crack pixels shown in Fig. 6(d).

Twenty further inspection images, not included in the training set, have been used to test the effectiveness of the proposed procedure.

Step 1 is applied dividing the inspection images into 15,708 samples of size 32 × 32 pixels. The precision obtained in this step is \(p\) = 98.4%—calculated as TPs/(TPs + FPs)—while the recall is \(r\) = 60.8%—calculated as TPs/(TPs + FNs)—resulting in an \({F}_{1}\) score of 75.2%—calculated as \(2pr/(p+r)\).

As a literature comparison, the results obtained by Xu et al. [26] are reported. The authors of the mentioned paper implemented a fusion convolutional neural network which showed excellent performance with the same dataset employed in this study. Specifically, the authors classified 64 × 64 images in “crack”, “handwriting”, and “background”. In the mentioned study, considering cracks as “positives” and the other two classes as “negatives”, the precision obtained considering 6720 image samples of size 64 × 64 pixels is 95.5%, while the recall is 93.9%, resulting in an \({F}_{1}\) score of 94.7%. Although the recall obtained in this paper is relatively low compared to the reference study, the simplicity of the CNN and the pixel-level crack identification obtained with a block-level training process are strength points of the presented method. Also, the precision indicator is very high, denoting a correct classification of pristine areas as non-cracked regions, despite the presence of several disturbances.

It should be noted that a higher true-positive rate could be obtained by adjusting the weights of the loss function used in the training procedure, giving more importance to the correct identification of positive samples. However, in this way, the false-positive rate would increase, drawing the attention of expert operators even when positives are caused by disturbance sources. It should be noted that a relatively low value of false negatives does not mean that the cracks are not identified. Indeed, generally, missing positives in Step 1 would only affect the estimated crack length, as shown hereafter.

For the considered inspection images, the crack length and width have been identified using the characterization process presented in Sect. 2.3. The results obtained for each image are reported in Fig. 7. In particular, the crack length shown in Fig. 7a corresponds to the number of pixels identified as a topological skeleton in the identified damaged regions, while the crack width pertains to the mode value of the crack width distribution, which has been identified using the distance transform. These results indicate a general underestimation of the crack length, which mainly depends on the presence of FNs in the outcome of the CNN. On the other hand, crack width is typically accurately estimated. Specifically, the average error (in absolute value) on length identification in the 20 inspection images is 31%, while the average width error is 38% (that, however, represents an average error of 2 pixels).

Fig. 7
figure 7

Identified crack length (a) and width (b)

The overall prediction accuracy of Step 1 is illustrated in Fig. 8 using the receiver operating characteristic (ROC) curve that has an area under the curve (AUC) equal to 0.804. For an in-depth analysis of the crack width, the occurrence distributions obtained for every inspection image are reported in Fig. 9, normalized to have a total occurrence value equal to 1. The inspection number is written in the upper-right corner of each plot, in the form I#. In general, the identification results show higher probability values for higher widths when compared against the reference values. This is attributed to noise sources that are incorrectly identified as crack areas during Step 2, as well as to the fact that reference crack pixels are user defined and, thus, may suffer from subjectivity and imprecision in the definition of the crack boundaries. Nevertheless, the mode of the identified occurrence distributions is generally close to that of the reference distributions.

Fig. 8
figure 8

Receiver operating characteristic curve

Fig. 9
figure 9

Crack width distributions

In certain cases, discrepancy is noted in the numerical results shown in Figs. 8 and 9 with respect to the reference values. However, a visual analysis of the identified crack pixels shows that the identification outcome still coincides with actual cracked areas. Figure 10 contains the results obtained for four cases, namely inspections 10, 13, 16, and 18. Inspection 10 has an underestimated crack width (see Fig. 9), inspection 13 results in an underestimated crack length, inspection 16 has a considerably overestimated crack width, while inspection 18 has a slightly overestimated crack length and underestimated crack width using the proposed method.

Fig. 10
figure 10

Crack detection using images from the testing dataset. From the left: original image, identified damaged regions (yellow), and identified crack pixels (red) comparison with labels (white): Inspection 10 (a), 13 (b), 16 (c), and 18 (d)

The original inspection images (left) contain RGB color information, which is removed prior to application of the proposed method. The yellow areas in the central part of Fig. 10 indicate the identified damaged regions after the cleaning process. Some FPs remain, especially in Fig. 10b and d. In the first case, FPs also lead to an overestimation of the crack length. On the other hand, FNs can be found in Fig. 10a and c. However, in these examples, it is possible to notice that marker handwrites of different thicknesses and colors do not affect the efficacy of the crack detection algorithm.

The right-hand side of Fig. 10 shows identified crack pixels superimposed to the manually selected labels. It is possible to observe that cracked areas are generally well identified, while the identification performance seems independent from camera distances, angles, exposure, and lighting conditions. Indeed, although inspections 16 and 18 have a similar camera angle, the performances in crack size identification differ, as mentioned before. Also, similar results in terms of crack length are obtained for inspections 10 and 13, which are taken under different lighting and exposure conditions.

Figure 11 displays a small detail of Fig. 10a, highlighted as a yellow rectangle in the figure of the identified crack pixels (right). In this figure, it is possible to observe that the identified crack pixels are visually in good agreement with the real crack. Moreover, several bumps and ramifications in the original crack image are well identified (see the area highlighted with dashed contour in Fig. 11). In contrast, the reference label tends to regularize the crack profile and smooth its width. It should be reminded that the labels are obtained by manually (and roughly) selecting the crack areas and, therefore, do not necessarily reflect a real representation of the crack. However, manually selected labels are used in this study to obtain rough estimates of the crack size. The simplification of the reference labels is found to be the main source of discrepancy for the results reported in Figs. 8b and 9.

Fig. 11
figure 11

Detail of the crack pixel detection and characterization process, compared to a reference label

Using the Nikon D7000 camera employed in this work at a distance of 1–2 m distance and a focal length of 80–100 mm, the minimum detectable width is approximately 0.05 mm/pixel [26]. This limitation dictates a lower boundary in the crack identification size. Indeed, cracks thinner than 0.05 mm would either be identified as cracks with zero-pixel width or not identified. This phenomenon can be observed in the detail highlighted with dashed contour in Fig. 11. The camera employed in this study is considered a consumer-grade camera and is comparable of the cameras employed in most inspection drones employed in literature studies [31].

As a further comparison, the results obtained using only the Sobel edge detection method (applied using the same parameters of Step 2 of the proposed procedure, see Sect. 2.3) are reported in Fig. 12 for the four selected inspection images. Although this method may perform well with modest noise sources, in this case, shadows, surface textures, and handwrite scripts are detected as possible crack areas, highlighting the need for the proposed Step 1.

Fig. 12
figure 12

Results obtained using the Sobel edge detection method: Inspection 10 (a), 13 (b), 16 (c), and 18 (d)

Further unlabeled inspection images with external objects have been chosen to test the robustness of the procedure with respect to unseen perturbation objects. Figure 13 shows two examples from this dataset that include a red ruler, which was not present in any photo of the training dataset. The identification results are insensitive to this new disturbance source, and the identified crack pixels are in good agreement with the actual crack. Although the numbers and the measurement marks on the rule have a width comparable with that of the cracks and present similar color gradients as cracks, no FPs are detected.

Fig. 13
figure 13

Crack detection using unlabeled images with external objects; from left: original image and identified crack: Inspection 21 (a) and 22 (b)

4 Conclusions

In this paper, a method based on convolutional neural networks and edge detection is proposed to identify fatigue cracks in images acquired from welded connection details of steel structures. The method consists of two steps, first, the classification of small regions in the inspection figure using a convolutional neural network and then subsequent identification of crack pixels in the identified damaged areas. To increase the performance in terms of accuracy, a correction procedure based on morphological operations is implemented. At the end of Step 1, an AUC equal to 0.804 is achieved for the testing dataset. The crack characteristics (length and width) identified in the second step are generally in good agreement with the reference values, identified using manually defined crack labels. Although the identified crack pixels and labels present some discrepancies, mainly due to the subjective labeling in the reference images, the results of the proposed identification method are promising and have shown to be robust to different noise sources, as well as the presence of external objects (i.e., a red measurement ruler) that were not part of the training set.

One of the main advantages of the proposed method over existing techniques is the simple structure of the neural network. Moreover, the classification based on small regions allows considerable simplification of the process to generate the training dataset. In the technique presented in this paper, it is not strictly necessary to precisely label crack pixels in the inspection image; a rough selection of damaged regions is sufficient. The proposed two-step procedure is a valuable tool that may help operators to limit the subjectivity of their evaluations and automatize the crack detection process. In addition, the simple structure of the neural net reduces the required computation power and thus, may allow for computer-aided in situ inspection schemes.