Keywords

1 Introduction

An enormous number of pictures are taken at construction sites and disaster areas by engineers and are used for inspection, management, disaster recovery, and scientific interests. Those pictures include objects such as construction machinery, signage, signboards, construction workers, etc. In order to use those pictures for inspection and management of construction, those objects must be manually detected by humans. In order to improve the efficiency, the detection process should be automatically executed. Recently, photographs can be classified by object detection functions using machine learning, e.g., People of iOS [1] and face grouping in Google Photo [2]. However, those systems are mostly for ordinary things such as faces, automobiles, bicycles, televisions, etc., but not for specific civil engineering entities such as construction machinery, signboards at construction sites, stakes, scaffoldings, etc.

In object detection in images, machine learning using features of the image has often been employed. Features used for object detection include local features, such as Haar-like features [3], Histograms of Oriented Gradients (HOG) features [4], Edge Oriented Gradients features [5], Edgelet features [6], etc. However, appropriate features must be determined by the user and it is difficult to achieve the satisfactory level of object detection. In addition, since using single feature makes much false detection, joint features or boosting [7] which combines plural object detection mechanisms are necessary, which requires much work and effort during the machine learning phase.

On the other hand, in deep learning or Convolutional Neural Networks (CNNs), features of objects can automatically be found and computed. Thus, the user does not have to determine the features and the accuracy of object detection has been improved significantly. Research on ordinary object detection include VGG-16 [8], Regions with CNN features (R-CNN) [9], Fast Region-based Convolutional Network (Fast R-CNN) [10], Faster R-CNN [11], Single Shot Multibox Detector (SSD) [12], You Only Look Once (YOLO) [13], etc. However, huge amount of data is required for deep learning. For ordinary object detection, large datasets for learning such as Pascal VOC [14], Caltech101 [15], COCO dataset [16, 17], UCI Machine Learning Repository [18] are provided for learning. For example, Pascal VOC has 20 classes such as person, animal, vehicle, indoor with 9,963 images containing 24,640 annotated objects. However, datasets for non-ordinary objects such as construction machinery, civil engineering objects are not prepared yet and must be developed to achieve the high accuracy in object detection.

Machine learning methods include supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning. Recently, transfer learning receives attention from Artificial Intelligence (AI) researchers. In transfer learning [19, 20], knowledge is stored for solving one problem and it is applied to a different but related problem. Transfer learning is thought to be particularly useful when large dataset cannot be collected easily. Fine tuning is a subset of transfer learning, which classifies objects that are already classified with similar but different labels. In this research, transfer learning but not fine tuning is employed.

The objective of this research is to develop a method to automatically detect construction domain specific objects such as construction machines, construction workers, construction signboards from digital images using deep learning with transfer learning and to classify the pictures according to the image contents. The reason for adopting transfer learning is that existing datasets lack of civil engineering specific objects and that it is hard to collect huge amount of such data to be needed for deep learning in a short time with a limited fund. To verify the proposed method, performance is compared among conventional machine learning using HOG features, deep learning with and without transfer learning. The reason that the single HOG features and Support Vector Machine (SVM) was employed for object detection is that it has an advantage for local shape change over other features described before.

2 Proposed Object Detection Method

2.1 Overview of the Proposed Method

The proposed method is to detect objects of the construction-specific domain and their positions from digital images. SSD is employed to detect positions as well as the types of the objects in images. Next, correct answer labels to re-learn SSD is created. LabelImg [21] is used to create learning data labels. Finally, the created dataset is re-learned to the SSD. By using the new weight acquired by transfer learning, the object position on the digital image are determined.

In this research, Keras was used as a machine learning library. It is one of the machine learning libraries written in Python. TensorFlow can be used as the back end with Keras. TensorFlow, which is developed and provided by Google Inc., was used for recognizing the object position as the backend. Figure 1 shows the overview of the flow of object position detection method.

Fig. 1.
figure 1

Overview of the flow of object position detection method.

SSD is an object detection algorithm using a simple network built by Liu et al. SSD is faster than conventional object detection algorithms. A simplified network model of SSD is shown in Fig. 2. VGG-16, which is a learning model of image detection, is used for the base network of SSD. After this base network, by using hierarchical feature maps, various scale objects can be processed, and the accuracy of detection rate can be high. The reason is that it identifies for each aspect ratio of the object. For these reasons, the SSD can detect a target object even from a relatively low-resolution image. Even if digital images taken at a construction site are high resolution, a low-resolution object exists in the distance of images.

Fig. 2.
figure 2

Simplified network image model of SSD (drawn by the authors referencing to Fig. 2 of Liu et al.).

2.2 Detection of Object Type and Position

Creation of Correct Answer Labels for Object Position Detection.

In order to detect an object from a digital image with machine learning, it is necessary to know what exists and where it is. Hence, a bounding box and its coordinates are required to create correct answer labels. In addition, correct answer labels must exist with the original digital images. They are created by using LabelImg [21]. This tool was built by Tzutalin to create a correct label called annotation data. The annotation data includes information such as the name of the original image and coordinates of a bounding box (Horizontal and vertical coordinates on the image). First, image data to be learned is opened (Fig. 3(a)). Then, the coordinates of upper left and lower right corners of the object are measured. It is possible to create bounding boxes of a plurality of objects for one image (Fig. 3(b)). The created annotation data is saved as an XML format file as shown in Fig. 3(c).

Fig. 3.
figure 3

Sample of supervising data.

Transfer Learning and Object Type/Position Detection.

The created correct labels and the original images are learned and stored in SSD. When learning is completed, a new set of weights is acquired. An appropriate set of weights is selected from the newly acquired weights and used for object detection. The selection of the appropriate set of weights is done from the transition of the loss coefficient acquired by transfer learning (Fig. 4). In Fig. 4, since the difference of loss between training data and test data is minimum at Epoch 41, the weight at this point is selected.

Fig. 4.
figure 4

A sample of transition of training data loss and test data loss.

Classification of Digital Image Files.

File folders of which names are detected objects such as bulldozers, backhoes, dump trucks, wheel loaders, workers, signboards, etc. are prepared in a computer. All digital images files that at least one object is detected are put into the designated file folder. If two or more objects are detected, the file is copied and put into the multiple designated folders.

3 Experiments

Two kinds of experiments were executed. Experiment I is to compare the detection result between the proposed method, i.e., deep learning with transfer learning and deep learning without transfer learning. Experiment II is to compare the results among (a) deep learning with transfer learning, (b) deep learning without transfer learning, and (c) conventional machine learning. Construction machinery (backhoes, bulldozers, dump tracks, and wheel loaders), workers, and signboards were selected as objects to be detected, and the accuracy of object position detection was tested. Detection accuracy is evaluated based on the criteria shown in Table 1. System environment for learning and testing is shown in Table 2. The number of images for Experiment I, II, and for testing is shown in Table 3.

Table 1. Evaluation criteria for detection accuracy.
Table 2. System environment for learning and testing.
Table 3. The number of images for Experiment I, II, and testing.

3.1 Experiment I: Verification for Deep Learning with Transfer Learning

Experiment I was executed for comparing the detection accuracy result between deep learning with transfer learning and without it. The model used for deep learning with transfer learning is named I-A while that for without it is named I-B in this research. 90% of the learning dataset was used for training and 10% for testing.

Figure 5 shows samples for positive detection and negative detection cases for each object. For example, in the negative detection of backhoe, the machine was detected as a dump truck. Figure 6 shows the result of the object detection experiment using the testing data. Detection accuracy for I-A (deep learning with transfer learning) was 86.6% for backhoes, 61.7% for bulldozers, 80.9% for dump trucks, 86.2% for wheel loaders, 100.0% for signboards, and 79.7% for workers. On the other hand, accuracy for I-B (deep learning without transfer learning) was, 0.0%, 20.6%, 0.0%, 17.1%, 60.9%, and 0.0%, respectively. Obviously, I-A shows much accuracy rate than I-B.

Fig. 5.
figure 5

Samples for positive and negative detection cases for each object class.

Fig. 6.
figure 6

Result of object detection experiment (Experiment I) using the testing data.

3.2 Experiment II: Comparison with Conventional Machine Learning

Experiment II was executed for comparing the detection accuracy result between deep learning with transfer learning and conventional machine learning (CML). The model used for deep learning with transfer learning is named II-A and that for CML is named II-CML in Experiment II.

Implementation of CML and Numbers of Images.

In this research, object detection by HOG features and SVM was executed using Dlib [22], which is a machine learning library. The reason for using Dlib is that objects can be detected with the same creation method of the correct labels as SSD. In object detection using Dlib, if the aspect ratio of the bounding box created as a correct label is greatly different as shown in Fig. 7, it cannot be used for learning.

Fig. 7.
figure 7

Sample case of very different aspect ratio of the bounding boxes created for representing correct labels.

Comparison of Detection Results between Deep Learning and CML.

Samples of the result of image detection are shown in Fig. 8. In Fig. 8, positive detection by II-A and negative detection by II-CML are shown. In II-CML, Dlib with HOG and SVM was used for detection whereas, in II-A, SSD was used for detection. Figure 9 shows the comparison of the accuracy between II-A and II-CML. The accuracy of the method using deep learning with transfer learning (II-A) is about 90% or higher for all objects while that of II-CML ranges between 30% and 75%. Thus, Experiment shows that the proposed method using deep learning with transfer learning has much higher accuracy compared to CML using HOG and SVM.

Fig. 8.
figure 8

Sample cases of positive detection by deep learning with transfer learning (II-A) and negative detection by conventional machine learning (II-CML).

Fig. 9.
figure 9

Testing result of object position detection comparing among deep learning with transfer learning (II-A) and conventional machine learning (II-CML).

4 Conclusions

In this research, a problem was identified in tasks such as collecting, sorting, annotating, storing, deleting, distributing enormous number of digital photographs taken by engineers at construction sites and disaster areas. It was the challenge to manually classify a huge number of photographs. In order to automate these error-prone and cumbersome tasks, an object detection method was proposed, which can detect not just the object class but the position as well as a bounding box. In the proposed method, deep learning (CNNs) with transfer learning was employed because construction-specific objects such as construction machinery, workers, signboards, are not available in the large amount of datasets provided for AI researchers. Most of the datasets usually include ordinary things. The reason for adopting deep learning with transfer learning is that deep learning can automatically determine features from digital images while conventional machine learning requires manual feature selection. The second reason is while deep learning has this advantage, it requires a huge amount of training data, which is difficult to obtain for construction-specific objects that are not available in popular, open datasets. Thus, transfer learning, where a knowledge obtained for one problem can be applied to different but related problems. After the object position detection is done, digital photo files can be classified, copied and put into the designated folders automatically.

Based on the proposed methodology, a system was developed employing SSD and VGG-16. A number of digital images of backhoes, bulldozers, dump trucks, wheel loaders, construction workers, and construction signboards were collected and used for learning. To verify the proposed method, Experiment I, which compares the object detection accuracy between deep learning with and without transfer learning, was executed. As a result, proposed method showed about 80% accuracy while the latter showed very poor performance. Next, Experiment II, which compares the deep learning with transfer learning and conventional machine learning (CML) using HOG features with SVM. The result showed that the proposed method showed about 90% accuracy while that of CML ranged from 30 to 75%.

In conclusion, the proposed method of deep learning with transfer learning can be a useful and effective way to detect object positions from digital images, especially in the area where is domain is specific such as construction sites and huge datasets are not available.

Future work includes developing an object shape detection methodology and apply it to construction-specific objects and improving the object detection accuracy by enhancing the quality of data and increasing the amount of data. As the potential applications of this research in construction and disaster management are more widespread, the authors are planning to apply the future research outcomes to various applications.