1 INTRODUCTION

Considering that the unsafe actions and behaviors of employees in the construction industry account for about 80–90% of incidents [1], [2], and there is a meaningful correlation between behavior measurement and incident rates [3], worker behavior management therefore plays a critical role in the success of construction projects. Consequently, the most effective way to enhance safety can be considered behavior measurements (e.g. finding the frequency of unsafe actions and postures) and modifications (e.g. offering feedbacks, setting goals, and engaging workers) [48]. The industry, however, lacks cost-effective and robust methods to measure workers’ behavior [9]. Behavior observation requires (1) a time-consuming and labor-intensive task in collecting and analyzing records [10], (2) plenty of data to deal with inconsistent and biased results [11], and (3) active worker participation in observing and reporting their own and colleague’s unsafe behavior [12]. These requirements impose practical constraints on behavior measurement, and thus hinders the key to behavior-based safety management [13, 14].

Computer vision techniques have previously been used in construction-related research, focusing on the detection of construction workers, site machinery and on progress tracking [1519]. In [20], Teizer described the status quo and challenges in computer vision in construction. Among computer vision techniques, histogram of oriented gradients (HOG) is one of the most widely used. Park and et al. [15] used HOG and the histogram of HSV (Hue, Saturation, Value) colors as inputs for a k-nearest neighbors (KNN) classifier. HOG, Histogram of Optical Flow (HOF), Motion Boundary Histogram (MBH) were used for the recognition of construction worker actions in [21]. Besides HOG, Haar Cascade [22] is another popular technique used in construction. Du et al. [23] used an approach based on the Haar Cascade to detect the workers’ hard hats at construction sites. Kim et al. [24] used a combination of KNN and Scale Invariant Feature Transform (SIFT) algorithms to parse a complete image from a construction site. For traditional computer vision technologies, it is quite straightforward in that the extracted features are used to code one image, and then to conduct classification and clustering for labeling other images. However, the features are extracted by the predefined and special-purpose optimized models and these models can only be manually developed where high-dimensional features are required. Therefore, when multi-feature models are simultaneously considered, these traditional methods (HOG, HOF, MBH, etc.) lose their advantages or even may fail to perform the designed tasks.

Convolutional Neural Networks (CNN) have stood out as an effective method for solving image-based object detection and classification in construction-related problems [25]. Numerous studies have applied CNN-based algorithms to detect unsafe behavior in construction site. In [26], a Faster R‑CNN and deep CNN model were developed by W. Fang et al to identify workers and their harnesses. Precision and recall rates for the Faster R-CNN to detect workers were 99% and 95%, respectively, while those for the CNN to detect people not wearing their safety harnesses were 80% and 98%, respectively. In [27], Q. Fang et al introduced a deep-learning-based occlusion mitigation method for personal protective equipment (PPE) checking. Their experimental results have demonstrated that the method is robust under various conditions. In [28], Q. Fang et al developed an algorithm based on deep learning for non-hardhat-use (NHU) detection using more than 100.000 construction worker images. The results showed high precision and recall rate and that the proposed method can facilitate safety inspection and supervision.

When applying CNN to image classification problems, a pre-trained network, a network that was previously trained on a large-scale image dataset and had proved itself highly effective in dealing with small dataset cases, is often used. CNN architectures such as the VGG family [29], Inception_V3 [30], and Resnet50 [31] are typically trained on the ImageNet dataset [32] and have obtained very good results for general image classification. Inspired by such an accomplishment, K. Zdenek et al. [33] retrained the VGG-16 deep learning network on 4000 augmented images. The authors used the core part of a deep CNN called VGG-16 to transfer the image feature knowledge stored in the VGG-16 model to the guardrail detection model. Then, the MLP model was trained to process the output of the core VGG-16-based object detection. Its performance was shown to be better than a support vector machine (SVM) in the conclusion.

In summary, the previous studies were primarily limited to the detection of dangerous signs related to safety equipment such as helmets, protective clothes, guardrails. In addition, the datasets came from ideal laboratory conditions, not actual working environment.

The main contribution of this paper is a method for classifying three unsafe behaviors based on machine vision and deep learning technologies. The aim is to recognize a worker’s dangerous behavior and to accelerate risk analysis and assessment with high accuracy. The CNN architecture is used in combination with the transfer learning approach and optimizations of the model’s helper parameters. In addition, a dataset with three labeled classes is made available to the research community on GitHub [34]. The study considers only the problem of classification of types of unsafe behavior. It is preceded by the problem of detection of any type of unsafe behavior, which must be also solved by an automated system coupled with video surveillance equipment.

The remainder of the paper is organized as follows. Section 2 describes the proposed method. Then, the results are analyzed in Section 3. Finally, brief conclusions are made in Section 4.

2 METHODOLOGY

2.1 Overview of the Proposed Method

The proposed method for classifying unsafe behavior includes five stages: (1) data collection, (2) data preprocessing, (3) learning rate optimization, (4) implementing the CNN with the transfer – learning approach and (5) experimentation and evaluation (Fig. 1). First, images are captured from an actual construction site in a variety of settings. Next, usable images are selected from the collected set during pre-processing. Then, optimal learning rate is figured out through an automatic finder algorithm. After that, the transfer – learning approach is used to enable the training of Deep Neural Networks without the need for millions of labeled data points typically for such complex models. Finally, the results are evaluated, analyzed and compared to a number of typical CNN architectures.

Fig. 1.
figure 1

Overview of the proposed method.

2.2 Data Collection

When using machine learning for digital image classification, the number and variety of the images in the dataset will greatly affect the classification accuracy. Currently, common datasets for classifying unsafe behaviors are either limited in quantity or are not collected from real working environment, therefore reducing the practicality of research based on such datasets. In this study, images are gathered from a real environment to obtain a more diverse selection. There is a total of 5000 images, captured using supervision camera from a real construction site in Japan, and divided into three dangerous behaviors as shown in Fig. 2.

Fig. 2.
figure 2

Dataset sample. (a) Unsafe action 1: Human reach out, (b) Unsafe action 2: Human legs out, (c) Unsafe action 3: Human Climb Wrong.

2.3 Data Pre-processing

The collected images at 1200-by-1080 pixels are then pre-processed through the following steps:

• Step #1 – Detect and remove blurry images: This can be done by either of the two methods: Laplacian [35] operator or Fast Fourier Transform (FFT) [36]. Although the FFT method requires some manual tunings, it has proved to be more robust and reliable in blur detection than the variance of Laplacian method.

• Step #2 – Detect and remove irrelevant images: Images that do not contain unsafe actions are manually removed.

• Step #3 – Detect and remove duplicate images: Duplicate images introduce bias into dataset, making the deep neural network learn patterns specific to those. In addition, they hurt the model’s ability to generalize to new images outside of what it was trained on. To detect and remove duplicate images, a method called “image hashing” is used [37] (Fig. 3).

Fig. 3.
figure 3

Image-hashing function.

The processed images are then separated into three parts: training, testing, and validation with a ratio of 8:1:1, respectively. Before training, the images are also augmented [38] and resized to fit model’s input requirement.

2.4 Learning Rate Optimization

The automatic learning rate finder algorithm works through the following steps (Fig. 4):

Fig. 4.
figure 4

Procedure for learning rate optimization.

• Step #1: Start by defining an upper and lower bound on the learning rate. The lower bound should be very small (1 × 10–10) and the upper bound should be very large (1 × 101). At the lower bound, the learning rate should be too small for the network to learn, and at the upper bound, the learning rate should be too large, and the model would overfit.

• Step #2: Start training the network from the lower bound. After each batch update, the learning rate is exponentially increased.

• Step #3: Continue training until the learning rate hits the maximum value. Typically, this entire training process/learning rate increase takes 1–5 epochs.

• Step #4: After training is complete, a graph of loss versus learning rate is plotted, to identify the points where the learning rate is:

▪ Large enough for the loss to decrease

▪ Too large, to the point where loss starts to increase.

The following figure shows the visualized output of the learning rate finder algorithm on the dataset using a variation of VGG19 network (Fig. 5).

Fig. 5.
figure 5

Learning rate finder algorithm output.

Notice that from 1 × 10–10 to 1 × 10–8 the loss only slightly decreases, meaning the learning rate is too small for the network to actually learn. Starting at approximately 1 × 10–7 the loss starts to decline – this is the smallest learning rate where the network can actually learn.

By the time we hit 1 × 10–6 the network is learning very quickly. At a little past 1 × 10–4, there is a small increase in loss, but the big increase doesn’t begin until 1 × 10–1.

Finally, by 1 × 101 the loss has exploded – the learning rate is now far too high for the model to learn.

Given this plot, through visual examination, the lower and upper bounds on the learning rate for the Cyclical Learning Rate (CLR) [39] can be determined to be 1 × 10–7 and 1 × 10–4 respectively.

2.5 Implementing Transfer-learning

A CNN architecture typically consists of several convolutional blocks and a fully connected layer. Each convolutional block is composed of a convolutional layer, an activation unit, and a pooling layer. A convolutional layer performs convolution operation over the output of the preceding layers using a set of filters or kernels to extract the features that are important for classification.

Although the development of a deep learning model for dangerous behavior recognition and classification is the key part of this research, training a model from scratch takes a considerable amount of time even with a workstation-level computer. For example, the training of the famous CNN classifier AlexNet [40], takes five to six days on NVIDIA GTX 580 3GB GPUs due to the large number of images. This long training time prevents quick validation of a trained classifier with various training options.

Transfer learning is an effective approach in reducing training time by fine-tuning a deep learning model that has previously been trained for a similar purpose. It is currently very popular in the field of Deep Learning because it enables the training of Deep Neural Networks with comparatively little data. In the present study, three different pre-trained CNN models including VGG19, Inception_V3 and InceptionResnet_V2 are experimented with to evaluate their performance.

The VGG network architecture was introduced by Simonyan and Zisserman in their 2014 paper [41]. This network is characterized by its simplicity, using only 3 × 3 convolutional layers stacked on top of each other in increasing depth. Volume size reduction is handled by max pooling. Two fully connected layers, each with 4.096 nodes are then followed by a softmax classifier. The number “19” stands for the number of weight layers in the network. VGG19 is a widely used convolutional architecture pre-trained on the ImageNet dataset. The second pre-trained model, Inception-v3, achieves state-of-the-art accuracy for recognizing general objects with 1000 classes, for example, “Zebra”, “Dalmatian”, and “Dishwasher”. This model first extracts general features from input images and then classifies them based on those features [42]. Meanwhile, Inception-ResNet-v2 is a convolutional neural network that achieves a new level of accuracy on the ILSVRC image classification benchmark [43]. It is a variation of the earlier Inception V3 model with ideas borrowed from Microsoft’s ResNet papers [44, 45]. Residual connections allow shortcuts in the model for researchers to successfully train even deeper neural networks, which have led to even better performance. The information for the VGG19, Inception_V3, and InceptionResnet_V2 models is illustrated in Table 1.

Table 1. Pre-trained model properties

For transfer learning, first, only the convolutional part of a model up to the fully connected (FC) layers (i.e. excluding the top FC layers) is initiated. Then it is run on the training and validation image data only once and the output of the last layer before the FC layer, i.e. the output features, is saved. After that, a customized FC layer is trained on top of these output features. The output of the last convolutional layer is flattened and connected to the ReLU-activated units of the FC layer. The output layer consists of a single unit with a softmax activation [46], i.e. a function that turns numbers, a.k.a. logits, into probabilities that sum to one. It outputs a vector that represents the probability distributions of a list of potential outcomes, in this case are the three classes. The dropout layers are added after the activation layers to avoid overfitting. Meaning that the customized fully connected output layer of the network should have three classes expected instead of those in the pre-trained models.

Next, the model is compiled on the few last layers of the network in order to adjust the pre-trained weight using loss functions such as Categorical Cross-entropy and optimizers such as Adam [47]. The training is performed using the Keras framework with a TensorFlow backend, an open-source deep learning framework [48]. The hardware used is a personal computer with the following configurations: Intel® Core™ i7-7700HQ CPU (4 cores, 8 threads) @ 2.80 GHz, 16GB of RAM and a GeForce® GTX 1050 4GB GPU. Figure 6 shows the customized FC layers architecture, and Fig. 7 demonstrates the VGG19 pre-trained model architecture.

Fig. 6.
figure 6

Customized FC layers architecture.

Fig. 7.
figure 7

VGG19 pre-trained model architecture.

3 EXPERIMENTATION AND EVALUATION

The classifiers are trained with the following hyper-parameters (Table 2).

Table 2. Hyper-parameters

For evaluation, the training dataset is used to fit the model. The validation dataset is used to provide an unbiased evaluation of a model fit the training dataset while tuning the model hyper-parameters. The trained network is asked to predict the label for each image in our testing set, which used to provide an unbiased evaluation of a final model fit on the training dataset. These predictions are compared to the ground-truth labels, the category of the actual images of the testing set. From there, the number of predictions that are correct can be computed into aggregate reports presented as the confusion matrix (Fig. 8). Precision, recall, f-measure and accuracy are used to quantify the performance of the network as a whole.

Fig. 8.
figure 8

Confusion matrix.

Recall is the ratio of the Actual Positives the model managed to identify (True Positive).

$${\mathbf{Recall}} = {\text{TP/(TP + FN)}}$$

Precision is how precise/accurate the model is, i.e. out of the predicted positive, how many of them are actual positive.

$${\mathbf{Precision}} = {\text{TP/(TP + FP)}}$$

F1 Score is a balance between Precision and Recall.

$$\begin{gathered} {\mathbf{F1}}{\text{ }}{\mathbf{Score}} \\ = 2 \times {\text{Precision}} \times {\text{Recall/(Precision}} + {\text{Recall)}} \\ \end{gathered} $$

Accuracy is the ratio of predictions the model predicted correctly.

$${\mathbf{Accuracy}} = ({\text{TP + TN}}){\text{/(TP + TN + FP + FN)}}$$

Comparing the three pre-trained models during the training process (Figs. 9, 10, 11, 12), the accuracy of InceptionResnet_V2 is always higher than those of Inception_V3 and VGG19. The accuracy of Inception_V3 is lower than that of VGG19. The pattern, is maintained during validation process, the accuracy of VGG19 is much higher than Inception_V3 but lower than InceptionResnet_V2. In terms of loss value, InceptionResnet_V2’s loss value is always lower than the other two models. The same also happens here where the loss value of Inception_V3 is higher than VGG19 during training and evaluation, and the loss value of InceptionResnet_V2 is much lower than Inception_V3 and slightly lower than VGG19.

Fig. 9.
figure 9

Training accuracy curves of three pre-trained models.

Fig. 10.
figure 10

Validation accuracy curves of three pre-trained models.

Fig. 11.
figure 11

Training loss curves of three pre-trained models.

Fig. 12.
figure 12

Validation loss curves of the three models.

The training/validate accuracy curves and training/validation loss curves are shown in figures from Figs. 13 to 15. For two VGG19 and InceptionResnet_V2 models, it is clear that their accuracy and loss values increase and decrease gradually, respectively, throughout the training/evaluation process. Meanwhile, the accuracy and loss values of Inception_V3 do not seem to possess such trend.

Fig. 13.
figure 13

Training/validation accuracy and training/validation loss curves of VGG19 model.

Fig. 14.
figure 14

Training/validation accuracy and training/validation loss curves of Inception_V3 model.

Fig. 15.
figure 15

Training/validate accuracy and training/validation loss curves of InceptionResnet_V2 model.

Figures 16, 17, and 18 are the confusion matrices of the three models. It is easy to see that Inception_V3 cannot distinguish among three unsafe actions. The average percent of images predicted correctly as their actual labels by InceptionResnet_V2 is the higher than that in VGG19 model and Inception_V3.

Fig. 16.
figure 16

Confusion matrix of VGG19 model.

Fig. 17.
figure 17

Confusion matrix of Inception_V3 model.

Fig. 18.
figure 18

Confusion matrix of InceptionResnet_V2.

Lastly, in Figs. 19, 20, 21, the average F1 score of InceptionResnet_V2 is greater than the other two models. The F1 scores of InceptionResnet_V2, VGG19 and Inception_V3 models are 0.91, 0.90 and 0.41, respectively.

Fig. 19.
figure 19

Classification report of VGG19.

Fig. 20.
figure 20

Classification report of Inception_V3.

Fig. 21.
figure 21

Classification report InceptionResnet_V2.

In conclusion, InceptionResnet_V2 has shown the best results to solve the problem mentioned in this paper. The high accuracy of this model, 92.44%, also confirms that the transfer learning approach is effective while also cut down on training time.

4 CONCLUSIONS AND PERSPECTIVE

In this study, a method for classifying workers’ dangerous behaviors based on machine vision and deep learning technologies is proposed. A new dataset with three labeled classes is created and published on GitHub for the research community. This dataset was collected from a real construction environment using cameras, which is then reviewed by occupational safety experts. They are the quality assurance for this research.

Based on the image dataset obtained, the transfer-learning approach is used with three pre-trained models, VGG19, Inception_V3 and InceptionResnet_V2. Customized FC layers are defined and associated with the above models to perform classification. The results indicate that InceptionResnet_V2 performs better than VGG19 and Inception_V3 for classifying workers’ dangerous behaviors. After 150 epochs, its accuracy reaches 92.44%, compared to the 91.16% and 47.06% of VGG19 and Inception_V3, respectively.

The proposed approach is not limited to the number of classes in the output classification. This challenge can be extended to building a large dataset that can cover more types of unsafe behavior in the future according to actual needs.

This work can serve as a reference for problems in Pattern Recognition and Object Classification. It can also be a reference for many fields in Deep learning, for example, Computer Vision, Parameterized Learning in Deep learning, Optimization Methods and Regularization in Deep learning, etc.