Keywords

1 Introduction

Metal constructions are widely used in transportation infrastructures, including bridges, highways and tunnels. Rust and corrosion may result in severe problems in safety. Hence, metal defect detection is a major challenge in civil engineering to achieve quick, effective but also safe inspection, assessment and maintenance of the infrastructure [9] and deal with materials’ deterioration phenomena that derive from several factors, such as climate change, weather events and ageing.

Current approaches in image analysis for detecting defects are through bounding boxes placed around defected areas to assist engineers to rapidly focus on damages [2, 13, 14, 18]. Such approaches, however, are not adequate for a structural analysis since several metrics (e.g. area, aspect ratio, maximum distance) are required to assess the defect status. Thus, we need a more precise pixel-level classification which can also trigger the novel ideas in construction of prefabrication [5]. Prefabrication allows components to be built outside the infrastructure, decreasing maintenance cost and time, and improving traffic flows and working risks. Additionally, real-time classification response is necessary to achieve fast inspection of the critical infrastructure, especially on large-scale structures. Finally, a small number of training samples is available, due to the fact that specific traffic arrangements, specialized equipment and extra manpower are required, increasing the cost dramatically.

1.1 Related Work

Recently, deep learning algorithms [20] have been proposed for defect detection. Since the data received as 2D image inputs, convolutional neural networks (CNNs) have been applied to identify regions of interest [14]. Other approaches exploit the CNN structure to detect cracks in concrete and steel infrastructures [2, 6], road damages [15, 23] and railroad defects on metal surfaces [18]. Finally, the work of [3] combines a CNN and a Naïve Bayes data fusion scheme to detect crack patches on nuclear power plants.

The main problem of all the above-mentioned approaches is that they employ conventional deep models, such as CNNs, which require a large number of annotated data [11, 12]. In our case, such a collection is an arduous task since the annotation should be carried out at pixel level by experts. For this reason, most of existing methods estimate the defected regions through boundary boxes. In addition, the computational complexity of the above methods is high, a crucial factor when inspecting large-scale infrastructures.

To address these constraints, we exploit alternative approaches in deep learning proposed for semantic segmentation but for applications different than defect detection in transportation networks such as Fully Convolutional Networks (FCN) [10], U-Nets [16] and Mask R-CNN [7]. The efficiency of the specific methods has been already verified in medical imaging (e.g. brain tumour and COVID-19 symptoms detection) [4, 21, 22].

Fig. 1.
figure 1

An overview of the proposed methodology flowchart.

1.2 Paper Contribution

In this paper, we apply three semantic segmentation-oriented deep models (FCN, U-Net and Mask R-CNN) to detect corrosion in metal structures since they perform more efficiently than traditional deep models. However, the masks derived are still inadequate for structural analysis and prefabrication because salient parts of a defected region, especially at the contours, are misclassified. Thus, a detailed, pixel-based mask should be extracted so that civil engineers can take precise measurements on it.

To overcome this problem, we combine through projection the results of color segmentation, which yields accurate contours but oversegments regions, with a processed area of the deep masks (through morphological operators), which indicate only high confident pixels of a defected area. The projection merges color segments belonging to a damaged area improving pixel-based classification accuracy. Experimental results on real-life corroded images, captured in European H2020 Panoptis project, prove the outperformance of the proposed scheme than using segmentation-oriented deep networks or traditional deep models.

2 The Proposed Overall Architecture

Let \(I \in \mathbb {R}^{w \times h \times 3}\) an RGB image of size \(w \times h\). Our problem involves a traditional binary classification: areas with intense corrosion grades (rust grade categories B, C and D) and areas of no or minor corrosion (category A). The rust grade categories stems from the standard ISO 8501-1 of civil engineering and are described in Sect. 5.1 and depicted in Fig. 3.

Figure 1 depicts the overall architecture of our approach. The RGB images are fed as inputs to the FCN, U-Net and Mask R-CNN deep models to carry out the semantic segmentation. Despite the effectiveness of these networks, inaccuracies still appear on the contours of the detected objects. Although these errors are small, when one measures them as a percentage of the total corroded region, they are very important for structural analysis and prefabrication.

To increase pixel-level accuracy of the derived masks, we combine color segmentation with the regions of the deep models. Color segmentation precisely localizes the contours of an object, but it over-segments it into multiple color areas. Instead, the masks of the deep networks correctly localize the defects, but fail to accurately segment the boundaries. Therefore, we shrink the masks of the deep models to find out the most confident regions, i.e., pixels indicating a defect with high probability. This is done through an erosion morphological operator applied on the initial detections. We also morphologically dilate the deep model regions to localize vague areas which we need to decide in what region they belong to. On that extended mask, we apply the watershed segmentation to generate color segments. Finally, we project the results of the color segmentation onto the high confident regions to merge together different color clusters of the same corrosion.

3 Deep Semantic Segmentation Models

Three types of deep networks are applied to obtain the semantic segments. The first is a Fully Convolutional Network (FCN) [10] which does not have any fully-connected layers, reducing the loss of spatial information and allowing faster computation. The second is a U-Net built for medical imaging segmentation [16]. The architecture is heavily based on FCN, though they have some key differences: U-Net (i) is symmetrical by having multiple upsampling layers and (ii) uses skip connections that apply a concatenation operator instead of adding up. Finally, the third model is the Mask R-CNN [7], which extends the Faster R-CNN by using a FCN. This model is able to define bounding boxes around the corroded areas and then segments the rust inside the predicted boxes.

To detect the defects, the models receive as input RGB data and generate, as outputs, binary masks, providing a pixel-level corrosion detection. However, the models fail to generate high fidelity annotations on a boundary level; contours over the detected regions fail to fully encapsulate the rusted regions of the object. As such, a region-growing approach, over these low confidence boundary regions, is applied to improve outcome’s robustness and provide refined masks. For training the models, we use an annotated dataset which have been built by civil engineers under the framework of EU project H2020 PANOPTIS.

4 Refined Detection by Projection - Fusion with a Color Segmentation

The presence of inaccuracies in the contours of outputs of the aforementioned deep models is due to the multiple down/up-scaling processes within the convolutional layers of models. To refine the initially detected masks, the following steps are adopted: (i) Localizing a region of high-confident pixels to belong to a defect as a subset of the deep masked outputs. (ii) Localizing fuzzy regions which we cannot decide with confidence if they belong to a corroded area or not, through an extension of the deep output masks. (iii) Applying a color segmentation algorithm in the extended masks which contains both the fuzzy and the high-confident regions. (iv) Finally projecting the results of the color segmentation onto the high-confident area. The projection retains the accuracy in the contours (stemming from color segmentation) while merging different color segments of the same defect together.

Fig. 2.
figure 2

Illustrating proposed projection method. (a) Detected regions of interest \(R_j\) (white). (b),(c) High confidence \(R_j^T\) (green) and fuzzy regions \(R_j^{F+}\) (yellow). (c) Color segments \(R_j^{w}\) (red) and the final refined contour \(R_j^{(a)}\) (blue). (Color figure online)

Let us assume that, in an RGB image \(I \in \mathbb {R}^{w \times h \times 3}\), we have a set of N corroded regions \(R=\{R_1,...,R_N\}\), generated using a deep learning approach. Each partition is a set of pixels \(R_j=\{(x_i,y_i)\}_{i=1}^{m_j}\), \({j=1,...,N}\), where \(m_j\) is the number of pixels in each set. The remaining pixels represent the no-detection or background areas, denoted by \(R^B\), so that \(R \cup R^B=I\) (see Fig. 2a). Subsequently, for each \(R_j\), \({j=1,...,N}\), we consider two subsets \(R_j^T\) and \(R_j^F\), so that \(R_j^T \cup R_j^F=R_j\). The first set \(R_j^T\), corresponds to inner pixels of the region \(R_j\), which are considered as true foreground and indicate high-confidence corroded areas. They can be obtained using an erosion morphological operator on \(R_j\). The second set \(R_j^F\), contains the remaining pixels \((x_i,y_i) \in R_j\) and their status is considered fuzzy.

We now define a new region \(R_j^{F+} \supset R_j^{F}\), which is a slightly extended area of pixels of \(R_j^{F}\), obtained using the dilation morphological operator. The implementation is carried out so that \(R_j^T\) and \(R_j^{F+}\) are adjacent, but non-overlapping. That is, \(R_j^T \cap R_j^{F+}=\varnothing \). Summarizing, we have three sets of areas (see Fig. 2b): (i) True foreground or corroded areas \(R^T=\{R_1^T,...,R_N^T\}\) (green region in Fig. 2b), (ii) fuzzy areas \(R^{F+}=\{R_1^{F+},...,R_N^{F+}\}\) (yellow region of Fig. 2b) and (iii) the remaining image areas \(R^B\), denoting the background or no-detection areas (black region of Fig. 2b).

In the extended fuzzy region \(R^{F+}\), we apply the watershed color segmentation algorithm [1]. Let us assume that the color segmentation algorithm produces \(M_j\) segments \(s_{i,j}\), with \(i=1,...,M_j\) for the j-th defected region (see the red region in Fig. 2c), all of which form a set \(R_j^w=\bigcup \limits _{i=1}^{M_j}s_{i,j}\). Then, we project segments \(R_j^w\) onto \(R_j^T\) in a way that:

$$\begin{aligned} R_j^{(a)}=\{s_{i,j}\in R_j^w: s_{i,j}\cap R_j^T \ne \varnothing \}, j=1,...,N \end{aligned}$$
(1)

Ultimately, the final detected region (see the blue region in Fig. 2c) is defined as the union of all sets \(R_j^T\) and \(R_j^{(a)}\) over all corroded regions \(j=1,...,N\).

5 Experimental Evaluation

5.1 Dataset Description

The dataset is obtained from heterogeneous sources (DSLR cameras, UAVs, cellphones) and contains 116 images of various resolutions, ranging from 194 \(\times \) 259 to 4248 \(\times \) 2852. All data have been collected under the framework of H2020 Panoptis project. For the dataset, 80% is used for training and validation, while the remaining 20% is for testing. Among the training data, 75% of them is used for training and the remaining 25% for validation. The images vary in terms of corrosion type, illumination conditions (e.g. overexposure, underexposure) and environmental landscapes (e.g. highways, rivers, structures). Furthermore, some images contain various types of occlusions, making detection more difficult.

Fig. 3.
figure 3

Representative image examples of rust grades.

All images of the dataset were manually annotated by engineers within Panoptis project. Particularly, corrosion has been classified according to the ISO 8501-1 standard (see Fig. 3): (i) Type A: Steel surface largely covered with adhering mill scale but little, if any, rust. (ii) Type B: Steel surface which has begun to rust and from which the mill scale has begun to flake. (iii) Type C: Steel surface on which the mill scale has rusted away or from which it can be scraped, but with slight pitting visible under normal vision. (iv) Type D: Steel surface on which the mill scale has rusted away and on which general pitting is visible under normal vision.

5.2 Models Setup

A common case is the use of pretrained networks of specified topology. Generally, transfer learning techniques serve as starting points, allowing for fast initialization and minimal topological interventions. The FCN-8s variant [10, 17], served as the main detector for the FCN model. Additionally, the Mask R-CNN detector was based on Inception V2 [19], pretrained over COCO [8] dataset. On the other hand, U-Net was designed from scratch. The contracting part, of the adopted variation, had the following setup: Input \(\rightarrow \) 2@Conv \(\rightarrow \) Pool \(\rightarrow \) 2@Conv \(\rightarrow \) Pool \(\rightarrow \) 2@Conv \(\rightarrow \) Drop \(\rightarrow \) Pool, where 2@Conv denotes that two consecutive convolutions, of size \(3 \times 3\), took place. Finally, for the decoder three corresponding upsampling layers were used.

Fig. 4.
figure 4

Visual comparison of the deep models’ outputs, without applying boundary refinement.

5.3 Experimental Results

Figure 4 demonstrates two corroded images along with the respective ground truth annotation for all of the three deep models, without applying boundary refinement. In general, the results depict the semantic segmentation capabilities of the core models. Nevertheless, boundary regions are rather coarse. To refine these areas, we implement the color projection methodology as described in Sect. 4 and visualized in Fig. 5. In particular, it shows the defected regions before and after the boundary refinement (see the last two columns of Fig. 5). The corroded regions are illustrated in green for better clarification.

Fig. 5.
figure 5

Boundary refinement outputs for specific areas.

Objective results are depicted in Fig. 6 using precision and F1-score. The proposed fusion method improves precision and slightly the F1-score. U-Net performs slightly better in terms of precision than the rest ones, while for F1-score all deep models yield almost the same performance. Investigating the time complexity, U-Net is the fastest approach, followed by FCN (1.79 times slower) and last is Mask R-CNN (18.25 times slower). Thus, even if Mask R-CNN followed by boundary refinement performs better than the rest in terms of F1-score, it may not be used as the main detection mechanism due to the high execution times.

Fig. 6.
figure 6

Comparative performance metrics for the different deep models, before and after the boundary refinement.

6 Conclusions

In this paper, we propose a novel data projection scheme to yield accurate pixel-based detection of corrosion regions on metal constructions. This projection/fusion exploits the results of deep models, which correctly identify the semantic area of a defect but fail on the boundaries, and a color segmentation algorithm which over-segments the defect into multiple color areas but retains contour accuracy. The deep models were FCN, U-Net and Mask R-CNN. Experimental results and comparisons on real datasets verify the out-performance of the proposed scheme, even for very tough image content of multiple types of defects. The performance is evaluated on a dataset annotated by engineer experts. Though the increase in accuracy is relatively small, the new defected areas can significantly improve structural analysis and prefabrication than other traditional methods.