Introduction

Highway construction projects rely heavily on the efficient management and deployment of a wide range of heavy machinery. Equipment such as excavators, which are used for ground preparation, and dump trucks, which handle material transportation, is critical to the successful completion of various construction phases (Kim, Kim et al., 2018b; Nath & Behzadan, 2020). Traditionally, equipment classification and identification were based on manual inspection by trained personnel (Akhavian & Behzadan, 2015; Cheng et al., 2010). Although this method can achieve a certain level of accuracy, it has several limitations. The manual classification is inherently time-consuming and resource-intensive. Furthermore, the potential for human error can cause inconsistencies and inaccuracies, particularly in large-scale projects involving a variety of equipment types (Sherafat et al., 2020).

Deep Learning (DL) methods, including Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), have emerged as powerful tools in a variety of fields, including construction management (Elghaish et al., 2022; Ji et al., 2023; Park et al., 2023; Soltani et al., 2016; Yabuki et al., 2018). Unlike artificial neural networks (ANNs), the capability of CNN models to autonomously identify complex patterns and representations from raw data offers promising opportunities for optimizing construction workflows and improving equipment management. Nevertheless, there is a research gap in the literature regarding the use of DL techniques - especially CNNs - to classify heavy machinery employed in highway construction projects (Akinosho et al., 2020).

Prior research has predominantly concentrated on applying DL to predict equipment failures, optimize maintenance schedules, and monitor construction progress. These studies highlight the potential of DL to transform multiple facets of construction management (Bunrit et al., 2019; Jung et al., 2022). However, there has been limited exploration into the specific challenges of classifying heavy equipment, particularly in highway construction projects (Akinosho et al., 2020). Additionally, existing research focuses on classifying a limited number of equipment categories, lacking comprehensive coverage of the diverse array of heavy machinery used in these projects. To bridge these research gaps, this paper aims at achieving the following research objectives:

  1. 1.

    Conduct a review of existing literature to identify previous studies on the classification of heavy construction equipment in highway construction projects to establish the novelty and significance of the study.

  2. 2.

    Evaluate previous studies on classifying construction equipment to understand existing approaches and limitations.

  3. 3.

    Develop a CNN model to classify a wide range of heavy construction equipment in highway projects, addressing limitations in previous studies focusing on fewer equipment classes.

  4. 4.

    Rigorously test the CNN model to demonstrate its accurate classification of heavy construction equipment, validating its potential for real-world applications.

Literature review

Precise classification and detection of construction equipment is crucial for enhancing project efficiency, ensuring safety, and optimizing resource allocation (Kim et al., 2018a). This allows for more efficient resource allocation, minimizing maintenance expenses and project delays (Post et al., 2018; Slaton et al., 2020a). By effectively monitoring equipment on construction sites, construction managers can improve productivity, reduce downtime, and mitigate risks (Mohy et al., 2024; Xu et al., 2023; Yan et al., 2017). Ultimately, real-time equipment monitoring contributes to keeping projects on track and within budget.

Traditional classification techniques like ANNs, Support Vector Machines (SVMs), and k-Nearest Neighbors (kNN) have been used in various classification tasks (Anirudh et al., 2023; Elshaboury et al., 2024; Kaveh, 2024a, b; Kaveh & Khavaninzadeh, 2023; Obianyo et al., 2023; Yamany, 2020; Zihan et al., 2023). These algorithms depend largely on manually extracted features that are created and fed into the algorithm. However, these classification models are limited by their learning capabilities and heavy reliance on expert domain knowledge to define features (Akinosho et al., 2020; Fang et al., 2016; Li et al., 2023). In contrast, the advent of DL, particularly CNNs, has transformed the field. CNNs have the capability to automatically learn relevant features directly from raw image data, eliminating the need for manual feature extraction (Xiao & Kang, 2021; Zhao et al., 2020). Groundbreaking research has been conducted on the use of CNNs in equipment classification, demonstrating that even shallow CNN architectures can be effective in tasks such as monitoring excavators. For example, one study found that CNNs could classify seven different excavator activities with 90.7% accuracy using data from inertial measurement unit signals (Slaton et al., 2020a). This success is attributed to CNN’s capability to efficiently extract spatial features from sensor data using parallel convolution operations.

Over the last decade, the use of DL for detecting construction equipment has expanded substantially. Table 1 provides a comprehensive comparison of various DL-based recognition techniques, serving as a valuable source for understanding the current landscape of DL applications in construction. This table outlines research efforts across different sub-fields of the broader construction domain, highlighting the versatility and growing importance of DL in addressing the challenges associated with the detection of construction equipment.

Table 1 Summary of studies used DL for object classification and detection on construction sites

There has been little emphasis in the literature on the classification and detection of heavy equipment used in highway construction projects. For example, Arabi et al. (2020) developed a practical DL approach for detecting six types of construction equipment used in highway construction. This approach achieved a mean average precision of over 90%, making it suitable for real-time construction applications such as safety monitoring and productivity assessment. In addition to classification and detection of construction equipment, other studies have investigated various aspects of equipment usage, including productivity for modular construction safety, which used R-CNN and achieved a precision of 0.890 (Zheng et al., 2020). Wang et al. (2022) developed a DeepLabV3 + model for monitoring construction sites with an accuracy of 0.926, whereas Braun et al. (2020) created a CNN model for monitoring construction tasks with a recall of 0.914 and an F1 score of 0.927. Moreover, Xiao and Kang (2020) focused on productivity-related tasks, illustrating the potential of DL techniques to optimize equipment utilization and operational efficiency. Furthermore, Shen et al. (2024) applied a Temporal Convolutional Network (TCN) model for monitoring equipment activities, achieving precision and recall scores of 0.945 and 0.944, respectively.

Most DL models developed for classifying and detecting construction equipment address a limited number of classes. Studies such as Ding et al. (2018); Hernandez et al. (2019a) focused on fewer than ten equipment classes. Ding et al. (2018) achieved a high accuracy of 0.970 in detecting unsafe behaviour using a CNN model, while Hernandez et al. (2019a) obtained an accuracy of 0.771 for general monitoring of equipment activity tasks using an LSTM model. In contrast, few studies have developed models for more than ten classes. For instance, Shen et al. (2024) and Nath et al. (2020) explored classification tasks involving a higher number of equipment categories, highlighting the need for further research in this area to develop more robust models capable of handling a broader range of equipment types. Accoridng to the literature review conducted in this study, most prior studies have concentrated on detecting and classifying construction equipment into ten or fewer classes. This underscores the necessity for advancements in DL models to handle more comprehensive classifications, particularly in complex and dynamic construction environments.

Overview of CNN model

Object classification and detection technology has evolved significantly, transitioning from methods that relied on hand-crafted features like Scale-Invariant Feature Transform (SIFT) and Speeded Up Robust Features (SURF) to deep learning approaches, particularly CNN (Nath et al., 2020). In contrast to traditional ANNs, which use image pixels directly for classification, CNN models simplify this process by consolidating weights into smaller kernel filters, which enhances learning efficiency and robustness. CNN represents a powerful type of deep neural network capable of directly learning complex patterns from data, leading to substantial advancements in object detection, image classification, speech recognition, and feature extraction (Fang et al., 2018; Huang et al., 2018; Zhang, 2022). CNN networks are structured with foundational components that enable sophisticated image processing, which are as follows:

  • Convolutional Layers: These layers apply convolution operations to input images, producing feature maps emphasizing specific visual patterns. During training, the network identifies and prioritizes important features necessary for accurate image scanning and categorization, as depicted in Fig. 1. The convolution operation can be mathematically represented as

$$\eqalign{{\rm{Output[i,j]}}\,{\rm{ = }}\, & \sum {\left( {{\rm{Filter[m,n]}} \times {\rm{Input[i + m,j + n]}}} \right)} \cr & + \,{\rm{BiasOutput[i,j]}} \cr} $$
(1)

Where Output[i, j] represents the value located at position (i, j) within the feature map; Filter[m, n] denotes the value positioned at (m, n) within the filter; Input [i + m, j + n] corresponds to the value found at position (i + m, j + n) within the input image; and Bias is a trainable value that adjusts the output of the filter.

Fig. 1
figure 1

Convolution operation in CNN model (Albelwi & Mahmood, 2017)

  • Pooling Layers: Situated between convolutional layers, pooling layers down sample feature maps while retaining essential features extracted by preceding layers. A common pooling operation is max pooling, which selects the maximum value within each pooling window.

  • Activation Functions: Non-linear activation functions like ReLU introduce non-linearity into the network, enhancing its ability to learn complex relationships. The ReLU function is expressed as.

$${\rm{ReLU(x)}}\,{\rm{ = }}\,{\rm{max(0,x)}}$$
(2)
  • Fully Connected Layers: At the final stages, fully connected layers process flattened outputs from convolutional layers to compute class probabilities using the SoftMax activation function for classification.

Research methodology

Figure 2 illustrates the systematic methodology employed for heavy construction equipment image classification using a CNN model.

Fig. 2
figure 2

Research methodology

Construction equipment image data collection

A comprehensive dataset of heavy construction equipment images was meticulously constructed for training and evaluating the CNN model. This dataset encompasses 10,846 images categorized into 12 distinct classes, ensuring a diverse representation of various equipment types (e.g., excavators and loaders). The dataset was divided into three subsets:

  1. 1.

    Training Dataset (60%, 6,595 images): This subset was used to train the CNN model, allowing it to learn the complex relationships between image features and their respective equipment classes.

  2. 2.

    Validation Dataset (30%, 3,291 images): This subset helped monitor the model’s performance throughout training to mitigate the risk of overfitting. High performance on the validation set indicates the model’s ability to generalize to new, unseen data.

  3. 3.

    Testing Dataset (10%, 960 images): This subset was used for the final evaluation after the training phase, providing an unbiased measure of the model’s accuracy in real-world classification tasks.

To maintain consistency and facilitate interpretation, each image was assigned a unique numeric identifier ranging from 0 to 11, corresponding to its specific equipment class. This labeling system simplifies the referencing and analysis of classification results. By using sequentially organized labels, the model’s predictions can be easily matched with the respective equipment types, improving the clarity and understanding of the results for subsequent data analysis, comparisons, and decision-making processes. Table 2 provides the labels and descriptions for the 12 equipment classes.

Table 2 Equipment code, label, and description

CNN model architecture development

To handle the complexities of construction equipment images, we designed a deep CNN architecture inspired by established models like VGG and ResNet (Simonyan & Zisserman, 2015). This architecture leverages multiple convolutional layers for feature extraction. Each convolutional layer uses rectified linear unit (ReLU) activations to introduce non-linearity and improve model performance. Max-pooling layers are strategically inserted between convolutional layers to reduce image dimensionality while preserving key features. The model’s depth is carefully chosen to capture intricate visual details crucial for distinguishing between various construction equipment types. As illustrated in Fig. 3, the network follows a sequential structure:

  1. 1.

    Convolutional Layers: The process starts with a convolutional layer containing 16 filters of size 3 × 3. This layer extracts low-level features from the input image. Subsequent convolutional layers, with increasing numbers of filters (e.g., 32), progressively extract more complex features.

  2. 2.

    Max-Pooling Layers: Interspersed between convolutional layers are max-pooling layers. These layers reduce the image size while retaining the most relevant features extracted by the preceding convolutional layers.

  3. 3.

    Fully Connected Layers: After feature extraction, the process transitions to fully connected layers. The flattened output from the final max-pooling layer is fed into a fully connected layer with 256 neurons and ReLU activation. This layer performs non-linear transformations on the extracted features. Finally, a second fully connected layer with a number of neurons equal to the equipment categories is employed. This layer utilizes the SoftMax activation function to generate probabilities for each equipment class, enabling multi-class classification.

Fig. 3
figure 3

Detailed architecture of the CNN model

CNN model training and validation

The CNN model was trained using the training dataset, while the validation set was used to evaluate the model’s performance throughout the training process. An appropriate optimizer (Adam) and a categorical cross-entropy loss function were utilized to reduce the classification error (Liu et al., 2023). During the training process, accuracy and loss metrics were continuously monitored for both the training and validation datasets. These metrics guided iterative adjustments to the model. Successful training is shown by high performance on the validation dataset; if not, the model architecture was refined. Refinements could include adding batch normalization, dropout layers, or other architectural changes. To prevent overfitting, techniques such as early stopping were employed to halt training when validation accuracy stopped improving or started declining. Additionally, data augmentation was used to artificially increase the size of the training dataset, providing the model with a wider range of examples for each class and thus mitigating overfitting. In instances of underfitting, where the model failed to capture the complexity of the data, the model’s capacity was increased, typically by adding more convolutional layers or neurons. Ultimately, the training process was conducted using a Jupyter Notebook, optimized for performance on a system equipped with an Intel (R) Core (TM) i7-10510U CPU @ 1.80 GHz, boosting up to 2.30 GHz.

CNN model testing and evaluation metrics

After successfully completing the training and validation processes, the final CNN model underwent rigorous testing using a separate testing dataset that had not been seen during training or validation. This test dataset was utilized to evaluate the model’s performance in accurately classifying heavy construction equipment images. The effectiveness of the CNN model in real-world scenarios was thoroughly assessed by analyzing performance metrics, including precision, recall, and F1-score.

$${\rm{Precision}}\,{\rm{ = }}\,{{{\rm{True}}\,{\rm{Positives}}} \over {{\rm{True}}\,{\rm{Positives}}\,{\rm{ + }}\,{\rm{False}}\,{\rm{Positives}}}}$$
(3)
$${\rm{Recall}}\,{\rm{ = }}\,{{{\rm{True}}\,{\rm{Positives}}} \over {{\rm{True}}\,{\rm{Positives}}\,{\rm{ + }}\,{\rm{False}}\,{\rm{Negatives}}}}$$
(4)
$${\rm{F1}}\,{\rm{ = }}\,{\rm{2}}\,{{{\rm{Precision \times Recall}}} \over {{\rm{Precision + Recall}}}}$$
(5)

Results and discussion

Various CNN architectures with different configurations and hyperparameters were explored during the training phase, and this section discusses the results of the training, validation and testing of the optimal design.

Performance evaluation of CNN model during training stage

Figure 4 shows the accuracy and loss curves of training and validation. The training accuracy curve shows a steady and continuous increase, reflecting effective learning and classification of the training data. The validation accuracy curve also progresses positively, indicating that the model generalizes well to new images. Additionally, the training loss curve, which consistently declines, indicates the model’s successful adaptation to reduce errors. The validation loss curve similarly decreases, suggesting the model’s capability to generalize and make accurate predictions on the validation data. The minimal variation in loss during training, along with its steady convergence to a low value (0.4), implies that the optimizer effectively finds the global minimum of the loss function. Overall, the training and validation curves reveal that the model has effectively learned the complex features of heavy construction equipment, achieving commendable accuracy and low loss metrics on both datasets. These results highlight the model’s potential to accurately classify and identify different types of construction equipment, contributing to enhanced operational efficiency, maintenance, and safety in the construction industry.

Fig. 4
figure 4

Training accuracy and loss vs. validation accuracy and loss

Performance evaluation of CNN model during testing stage

The classification results presented in Table 3 offer a detailed assessment of the model’s effectiveness in categorizing heavy construction equipment into 12 distinct classes during the testing phase. The precision scores average around 0.80, with a range from 0.71 to 0.87. Notably, the model shows high precision in categories like concrete mixer trucks (0.87), boom lifts (0.86), and telescopic handlers (0.84), indicating its high accuracy in identifying these specific equipment types. The variation in precision scores might be due to differences in visual complexity and distinctiveness among the classes, with equipment having more easily identifiable features achieving higher precision. Moreover, the recall scores, which reflect the model’s ability to correctly identify all relevant instances within a class, range from 0.73 to 0.86. The highest recall score of 0.86 was observed for Class 1 (boom lift), demonstrating the model’s high capability to detect true positives in this category. Conversely, the lower recall rate of 0.73 for Class 9 (pile driving machine) could be due to visual similarities with other equipment types or challenges in correctly identifying all instances.

Table 3 Classification report of testing phase

Furthermore, the F1-score, which balances precision and recall, ranges from 0.75 to 0.86. Class 1 (boom lift) attained the highest F1-score of 0.86, while Class 7 (loader) had the lowest F1-score of 0.75. The lower score for Class 7 suggests an imbalance between precision and recall, possibly due to challenges in accurately distinguishing this class based on visual features alone. Furthermore, the support values, indicating the number of instances per class, range from 66 to 92. Classes with higher support generally have more training data, which may contribute to better classification performance.

Overall, these metrics indicate that the model performs competitively in classifying heavy construction equipment. However, certain challenges persist, particularly in classes with lower precision and recall. Addressing these issues may require refining the model’s feature extraction capabilities and enhancing the training process to improve accuracy and generalization across all equipment categories.

To comprehensively assess the performance of the CNN model, the Receiver Operating Characteristic (ROC) curve, a metric for assessing classification model performance, was created and investigated. Figure 5 displays the ROC curves for the 12 distinct types of construction equipment. The model exhibits impressive performance, as evidenced by its high Area Under the Curve (AUC) values for all classes. Notably, the model achieves an AUC score of 0.92 for both the concrete mixer machine and telescopic handler (classes 2 and 11), indicating highly accurate classification. Moreover, both the forklift (class 6) and motor grader (class 8) demonstrate high performance, with an AUC value of 0.91.

Fig. 5
figure 5

ROC curves of equipment classification for testing stage

However, the pile driving machine (class 9) exhibits a lower AUC value of 0.83, indicating difficulties in accurately classifying this particular equipment type. The graphical representation in Fig. 5 provides a visual overview of the AUC metrics across different construction equipment categories, offering insights into the classifier’s effectiveness in distinguishing between equipment types based on the testing results. Besides, the precision-recall curves in Fig. 6 support these findings, showing that classes 6 and 8 achieve the best results, with average precision (AP) values of 0.73 and 0.74, respectively. In contrast, class 9 records the lowest precision-recall performance, with an AP score of 0.52.

Fig. 6
figure 6

Precision-recall curves of equipment classification for testing stage

In a separate effort, a confusion matrix was constructed during the testing phase to evaluate the CNN model’s performance in classifying construction equipment (Fig. 7). The matrix shows that the model achieved high accuracy in classifying concrete mixer machines, scissor lifts, concrete mixer trucks, and forklifts (classes 2, 10, 3, and 6, respectively). However, there is a room for improvement in accurately classifying asphalt rollers, telescopic handlers, excavators, and boom lifts (classes 0, 11, 5, and 1, respectively). To enhance the model’s performance, applying data augmentation techniques and fine-tuning the model’s hyperparameters could be beneficial. Additionally, a more in-depth analysis of the misclassified images, especially those with lower accuracy, may provide valuable insights for improving the model’s ability to distinguish between specific types of equipment.

Fig. 7
figure 7

Confusion matrix of equipment classification for testing stage

Conclusions

This paper introduces a CNN model developed specifically to tackle the challenge of accurately classifying heavy construction equipment on construction sites. This model represents a significant advancement in the identification and categorization of various heavy equipment in the construction industry. The developed CNN model demonstrates remarkable accuracy, with precision scores ranging from 0.71 to 0.87 and recall values ranging from 0.73 to 0.86 across various equipment classes. These findings highlight the model’s effectiveness in accurately distinguishing different types of construction machinery.

The real-world implications of adopting this CNN model are substantial. The model contributes to optimizing operational efficiency and logistics in construction projects by automating the identification and categorization of equipment on construction site. This results in enhanced resource allocation and more efficient equipment tracking, which leads to improved project execution. Furthermore, the CNN model enhances safety protocols on construction sites by providing a robust system for detailed equipment monitoring. This facilitates improved recognition of potential hazard and focused actions for risk mitigation, fostering a safer working environment for construction personnel. Additionally, the model’s streamlined operations and improved maintenance practices contribute to cost reductions.

It is crucial to acknowledge that the developed CNN model has some limitations. The model classifies a specific set of equipment on construction site. Moreover, there exists a potential imbalance in the training dataset due to data limitations. However, we can unlock the model’s full potential by addressing these limitations through future research efforts. Promising avenues for future exploration include integrating real-time data streams for continuous monitoring and adaptation, utilizing transfer learning techniques to expand applicability to a broader range of equipment categories, and investigating advanced image augmentation techniques to mitigate potential dataset biases and improve the model’s overall robustness.