Keywords

1 Introduction

Civil infrastructure can provide good services to the citizens as the operation and management activities. If maintenance is not carried out in time, it will not only cause potential hazards and hidden dangers to the civil infrastructure and its ancillary facilities, but also threaten citizen lives. Therefore, it’s essential for real-time monitoring the condition of the infrastructure, so that necessary repairs and maintenance work can be carried out proactively and timely before it becomes too dangerous and expensive. Conventional manual monitoring is extensively time-consuming, laborious, expensive and has healthy and safety problems, particularly for the aerial working environment where detection is difficult to conduct [1].

In the past years, deep learning techniques, especially convolutional neural network (CNN), have been shown to outperform previous state-of-the-art machine learning techniques in several fields, with computer vision being one of the most prominent cases [2]. Computer vision is changing processes of the construction management as it enables the automatic acquisition, processing, analysis of digital images, and the extraction of high-dimensional data from the real world to produce useful information to improve managerial decision-making [3].

Deep learning has obtained promising performance in various computer vision tasks such as image classification [4], object detection [5] and object segmentation [6]. These three tasks are not only related to each other, but also progressive. The connection is that they are all based on the basic idea of CNN. The progressive relationship increases difficulties of three tasks. Both object detection and segmentation use some basic network models from image classification. The CNN-based image classification algorithm provides many new ideas for object detection and segmentation, and has achieved good results. This paper will briefly describe these three tasks and make a general comparison. From the beginning, these tasks have been applied to the industrial field, until now, they have been applied to many other fields and made great achievements, and have great application prospects in civil infrastructure maintenance. The application of deep learning based on CNN in the automatic detection and location of defects in civil infrastructures [7], such as bridges [8], roads [9] and sewage pipes [6], can solve these problems.

The remainder of this paper is organized as follows. Figure 1 is the structure of this review work. In Sect. 2, the research progress of deep learning, including the structure of deep learning, is described. Section 3 described the use of deep learning methods to address key tasks in computer vision, such as image classification, object detection, and object segmentation. In Sect. 4, the application of deep learning-based computer vision in civil infrastructure maintenance are reviewed.

Fig. 1
figure 1

Review structure

2 Research Progress of Deep Learning

2.1 Important Milestones of Deep Learning

Deep learning methods usually addresses rich and complex data from different sources, and they have performed better than previous technologies in multiple tasks, attracting increasing attention. Where does it start? How does it determine whether a particular deep learning model is suitable for their problem? How to train and deploy them? With these questions, the important milestones leading up to the era of deep learning [2] are firstly summarized in Table 1. The MCP model and Neocognitron are the beginning of the artificial neural networks (ANN) and CNN, respectively. However, AlexNet [10] won the ImageNet contest in 2012 with an absolute advantage of 10.9 percentage points over the second place. Since then, deep learning and convolutional neural networks rose to prominence with AlexNet. An overview of deep learning structure based on CNN is presented next.

Table 1 Important milestones

2.2 Deep Learning Structure Based on CNN

CNNs were inspired by the visual system’s structure, in particular by its proposed models [11]. A CNN consists of three main types of layers, namely, convolutional layers, pooling layers and fully connected layers. Each type of layers has a different task. Figure 2 shows a general CNN architecture for an image classification task. In addition, CNN also has activation function, Batch Normalization and Regularization.

Fig. 2
figure 2

The general CNN architecture

  1. (i)

    Convolutional layers

    In the convolutional layers, various kernels are used to convolve the input data to generate feature maps. The convolution operation is to cover the entire image step by step with the convolution kernel according to the step size, and the value of the filter is multiplied by the pixel value of the corresponding position of the image and then summed. The value obtained is the value of the target pixel in the output image.

  2. (ii)

    Pooling layers

    The pooling layer reduces the spatial size (width × height) of the input volume of the next convolutional layer through maximum pooling or average pooling, but does not affect its depth. This operation can reduce the number of parameters in the network, reduce the consumption of computing resources, and can also effectively control overfitting. The operation process of the pooling layer is to first slide the input data through the spatial window, and select the maximum or average value as the output result, and then continue to slide the window until the entire input data is covered, and finally the output results of each sliding are in order arrange to obtain the final complete output data. In the whole process, reduce the spatial size of the input data. The size of the sliding window and the sliding step size will affect the output data, so it is necessary to use the appropriate size and step size for the accuracy of the results.

  3. (iii)

    Fully connected layers

    Following several convolutional and pooling layers, the high-level reasoning in the neural network is performed via fully connected layers. Fully connected layers play the role of classifier in the entire CNN.

  4. (iv)

    Activation function

    The emergence of the activation function Rectified Linear Units (ReLU) solves the problem that sigmoid and tanh are prone to disappearing gradients, which is currently the most commonly used activation function. Generally, the activation function is used after each convolution.

  5. (v)

    Batch Normalization and Regularization

    Batch Normalization is to force the distribution of the input value back to a standard normal distribution with a mean of 0 and a variance of 1, to avoid the problem of vanishing gradients. Dropout is a convenient but powerful regularization method, which randomly deletes some nodes in each iteration, and only train the remaining nodes to suppress overfitting.

2.3 The Relationship Between Machine Learning, Deep Learning, CNN, Computer Vision and Civil Infrastructure

Understanding the relationship between machine learning, deep learning, CNN, computer vision and civil infrastructure can help researchers understand this paper. For machine learning, the way to solve the problem is to find out the mapping relationship between X and Y through the model, among which the available models are logistic regression, linear regression, support vector machine (SVM) and others. While, using the type of neural network model is called deep learning, which including convolutional neural networks. The application of convolutional neural network to computer vision mainly has three major tasks, including Image classification, Object detection, Object segmentation. Then, these three tasks are applied to civil infrastructure, as can be seen from Fig. 3.

Fig. 3
figure 3

The relationship between machine learning, deep learning, CNN, computer vision and civil infrastructure

3 Application of CNN in Computer Vision

Deep learning has been widely adopted in various directions of computer vision, such as image classification, object detection and segmentation, which are key tasks for image understanding. The differences among the three tasks can be seen intuitively from Fig. 4, taking crack images of sewage pipes [12] as an example. In this part, the developments of deep learning in above-mentioned three tasks, especially the CNN- based algorithms, will be briefly summarized.

Fig. 4
figure 4

Example of crack images of sewage pipes in three tasks

3.1 Image Classification

The image classification task means that image is labeled with a probability of the presence of a particular visual object class [13], which is the simplest and most basic image understanding task. The task of the deep learning model is to achieve the first breakthrough and realize large-scale application.

In general, CNN is the most advanced compared to classical algorithms [14]. Through the continuous research and improvement of its structure, a series of network models have been formed and successfully applied in a wide range of practical applications, such as AlexNet, VGGNet [15], GoogleNet [16] and ResNet [17] as shown in Table 2. It can be seen from the table that more and more optimizations are applied to network design, such as Dropout, Local response normalization (LRN), and Batch normalization. The state-of-the-art results of the top-5 error rate tested by ImageNet since 2012 are also presented in Fig. 5. The model CNN-based is also used in the cracks of civil infrastructures, for example, Zhou and Song developed Deep CNN structures with different layouts for fracture classification based on laser scanning range images [4], Wang et al. proposed a CNN-based damage classification technology for deep buildings targeting masonry historical structures [18].

Table 2 Structure of typical convolutional neural networks models
Fig. 5
figure 5

The top-5 error rate results of typical model

3.2 Object Detection

Image classification is the basis of computer vision, but only classification is not enough. Object detection is different from but closely related to the image classification task. Object recognition and segmentation are more difficult but meaningful. The classification task is only concerned with classification, while the detection task is not only focus on classification, but also required to obtain the location of the detected object.

Object detection research has been conducted for many years, and there are many methods that have been widely recognized and applied in the industry. Several typical detection models and their feature are introduced in Table 3. Object detection is usually divided into two categories, one category is one-stage network, such as You Only Look Once (YOLO) [19,20,21] series and Single Shot MultiBox Detector (SSD) [22], the other is two-stage network, such as Regions with CNN features (RCNN) series [23]. In general, one-stage is faster, two-stage is more precise. Some scholars have applied these two types of detection networks to sewage pipes and have reached a consistent conclusion [24]. In addition, these algorithms have attracted the attention of the researchers, for example, YOLO was applied in various kinds of defects automatic detection [25], YOLO have also been used in detecting multiple damage on the surface of the concrete bridge [26, 27], Faster RCNN was used to detect and preliminarily evaluate the damage caused by earthquake to buildings [28].

Table 3 Object detection model feature

3.3 Object Segmentation

In addition to classification and object detection, it is also necessary to separate out all the pixels related to the object and give the categories even though it’s more difficult, which is called object segmentation.

Object segmentation consists of semantic segmentation and instance segmentation. The former is an extension of the pre-background segmentation, requiring the separation of image parts with different semantics [29]. Figure 6 shows the scores of its typical model in the VOC2012 dataset. While the latter is an extension of the detection task, which requires the outline of the objects and more refined than the detection frame. The MASK R-CNN and FCIS [30] are the most significant research outcomes in recent years. Compared with semantic segmentation, instance segmentation can label different individuals of the same type of object on the image, which is a comprehensive task combining image classification, object detection, and semantic segmentation.

Fig. 6
figure 6

The scores of typical semantic segmentation model

In general, Object segmentation is a pixel-level description of an image [14], which is suitable for scenes with high requirements for understanding. Such as the segmentation of roads and non-roads in auto pilot.

3.4 Typical Experimental Tools and Model Evaluation

Good tools such as datasets and computing platform, can make the research process more effective and successful. The development of deep learning is inseparable from the development of datasets. The typical datasets of image processing fields, including MNIST, PASCAL VOC, CIFAR, ImageNet, COCO, Open Image, and Youtube8M play an important role in the recent neural network researches in industry application, academic research and other fields. Programming tools that support deep learning are also very popular, such as TensorFlow, MXNet, PaddlePaddle, Caffe, Torch, and Theano, providing rich convenient interfaces for mathematical computation.

Deep learning is a branch of machine learning, and precision and recall are typical indicators for most machine learning. However, due to the uneven distribution of prior targets, traditional evaluation indicators are not suitable for multi-object detection models. Therefore, different types of classification errors should be considered when evaluating object detection models [5]. The performance of the model is summarized as two aspects: (1) accuracy. The precise recall, average accuracy (AP), mean AP and missing rate belong to accuracy; (2) calculating cost. Detection speed and training time belong to calculating cost.

4 Application in Civil Infrastructure

Civil infrastructures, including bridges, roads, tunnels, and underground utilities like sewage pipe, are becoming susceptible to losing their designed functions due to deterioration caused by use [7]. This inevitable situation means urgent maintenance is required. The condition monitoring of concrete surface plays a significant role in civil infrastructure management system [31]. Defects are the main threat to concrete surface of infrastructure. Traditional vision-based methods of crack detection lack accuracy and generalization when working on complicated infrastructural conditions [32]. At present, a number of computer vision-based crack detection techniques have been developed to enhance the efficiency, speed, and objectivity of inspection [33] and manage a large number of structures [34]. For example, as shown in Fig. 7, three computer vision tasks based on CNN, including image classification, object detection and object segmentation, are employed in three civil infrastructures, including sewage pipe, bridge and road, respectively.

Fig. 7
figure 7

Three kinds of civil infrastructure respectively use three kinds of computer vision tasks, where a, b, c is image classification, object detection and object segmentation respectively, and 1, 2, 3 is sewage pipe, bridge and road respectively

4.1 Sewage Pipe Inspection

As an important component of civil infrastructure, sanitary sewer systems are designed to collect and transport sanitary wastewater and stormwater. Sewer defect inspection is the key in identifying both the type and location of pipe defects to maintain the normal sewer operations [5] for maintenance of urban underground infrastructure [35].

For sewage pipe defect inspection, a CNN was initially used to detect and characterize cracks on an autonomous sewer inspection robot [36]. Currently, closed-circuit television (CCTV) and other visual inspection technologies have been widely used in the inspection of underground sewage pipelines. However, it’s time-consuming and the results are subjective [37] when relying on manual interpretation of the images or videos. However, the deep learning-based approach can automatically extract image features and improve the accuracy and efficiency, and it does not require much for image preprocessing. Therefore, several studies of deep learning-based approach exploration have been performed. For example, the method of image classification is applied to sewage pipe detection with a sufficiently large dataset (over 2 million CCTV images) by Dirk Meijer [38]. A deep learning-based approach is developed for sewer pipe defect detection using faster region-based convolutional neural network (Faster R-CNN) [12]. With the development of deep learning techniques, Yin et al. employ a state-of-the art convolutional neural network (CNN) based object detector, namely YOLOv3 network, for detection system of sewer pipes [39]. A unified neural network, namely DilaSeg-CRF, is proposed by fully integrating a deep convolutional neural network (CNN) with dense conditional random field (CRF) and applied to sewer pipe [6].

4.2 Bridge Inspection

Bridges play an important role in civil infrastructure. Periodic bridge inspections are very important to maintain the functionality, safety and reliability of the bridge structure. It’s essential for the continuous monitoring and maintenance of bridges. As bridges become obsolete, the number of bridges that need to be inspected increases, which requires a lot of maintenance costs. If postponing the cost of bridge maintenance, more costs will be required in the near future [40].

Traditional bridge detection methods rely on human visual inspection [41], which remains the most adopted approach among all nondestructive evaluation techniques that can be used to identify and monitor defects [42]. This method has limitations that the performance is highly related to the experience of the inspector, time consumption and accessible areas [40]. In this case, detection technology based on computer vision [43] and the idea of images obtained from drones [44] are proposed.

Zhang et al. use the applicability of the state-of-the-art single-stage detector YOLOv3 to identify various types of defects in concrete bridges and improve its performance in terms of detection accuracy [42]. Some researchers also use other deep learning-based methods to detect bridge damage and achieve better results, such as CNN [45,46,47,48], region with convolutional neural networks (R-CNN)-based transfer learning [40] when the dataset is not enough.

4.3 Road Inspection

With the rapid development of road traffic, road surface cracks not only affect the transportation efficiency but also pose a potential threat to vehicle safety. The importance of road maintenance has attracted increasing attention. It is crucial to repair the roads in time when potholes are appeared to prevent accidents in advance [49]. In reality, however, due to limited human resources, it is difficult to detect and repair potholes in time. A lot of research has focused on road damage detection, and there are three main methods: vibration sensor-based, laser scanning-based, and computer vision-based methods [50].

With the advent of CNN [51], image processing technology has made significant progress recently, and computer vision-based methods are widely utilized to research road defects. Image processing algorithms [52] mainly include threshold segmentation [53], edge detection [54] and region growth methods [55] for image processing and crack feature recognition. CNN algorithm is applied to concrete pavement crack detection [56, 57]. Chun et al. proposed Fully CNN-based road surface damage detection with semi-supervised learning to detect road damage [49]. Hybrid deep CNN is applied to the detection and location of moisture damage in asphalt pavements, including ResNet50 network for feature extraction, YOLOv2 network for identification, and detection and location of moisture damage [9].

4.4 Other Civil Infrastructures Inspection

In addition to the civil infrastructure mentioned above, other infrastructures also applied deep learning methods to detect damage. Structural health monitoring (SHM) is used to manage and maintain civil infrastructure, which generated a large amount of data. Traditional detection technology cannot effectively analyze these data, and it is time-consuming, laborious, and inefficient. Therefore, how to effectively monitor, mine and use the data requires in-depth research, which considers the introduction of deep learning-based methods for detection. Deep learning-based method is also used to detect crack from concrete surface [58,59,60,61,62,63], structure [64, 65], buildings [66, 67]. In addition, deep learning is also used to identify unsafe behavior from two-dimensional images that appear on construction site [68, 69]. Their experimental results show that the method has a significant improvement in accuracy and efficiency. In summary, deep learning has good application prospects in the field of construction.

Through the review, we found that although many people apply cutting-edge technologies such as computer vision to civilian infrastructure, they have not been implemented in practice and have not achieved real-time detection technology.

5 Conclusions

Computer vision has attracted the increasing attention of researchers and practitioners. This paper gives a brief review of the application of deep learning-based computer vision in civil infrastructure maintenance. Firstly, the research progress of deep learning was reviewed, including the important milestones and deep learning architectures. Deep learning is widely used in the three major directions of computer vision, image classification, object detection, and object segmentation. Secondly, the models used in these three aspects are summarized. Finally, the applications of deep learning-based computer vision for damage detection in the maintenance phase of civil infrastructure, including sewage pipes, bridges and roads, were reviewed.

Through the review, we can find that more and more people are paying attention to automation and intelligence. The application of cutting-edge technology to the construction industry is a measure that conforms to the times. Moving to the forefront of technology is a necessary condition for the development of automation and intelligence in the construction industry. Prosperous application prospects in other aspects of the construction industry. In recent years, these reviewed models have become new hotspots for deep learning and CNN to effectively applied in computer vision, multi-object classification and related fields. They are considered effective methods and tool by the industry and academia. Applying deep learning-based computer vision technology to the construction management field can achieve greater and more innovation and promote the transformation and development of the construction industry. However, we found that although many people apply cutting-edge technologies such as computer vision to civil infrastructure, the implementation is limited in practice, and the real-time detection is also still limited.