Introduction

Object detection in remote sensing images is an essential image-processing step for a range of applications such as industrial applications, agriculture, and military application (d’Acremont et al., 2019). The identification of land uses and objects using remote sensing images acquired by satellites is essential for regulating and tracking life form activities (Lin & Wu, 2019). Recently, due to the massive quantity of data provided by remote sensing photographs, deep learning has been crucial in remote sensing image applications such as object segmentation, target identification, object recognition, image augmentation, and image preprocessing (Liu et al., 2022). Various deep learning convolutional neural network (CNN) models have been developed and utilized in the realm of satellite imagery; these different models’ architecture can extract various deep characteristics and produce varying experimental outcomes (Ran et al., 2019). The employment of various feature extraction methods, which are capable of deep computational methods from the dataset, is largely responsible for CNN’s current learning capabilities, which makes the deployment of CNN algorithms in the processes of remote sensing object detection a great performance enhancement for the system accuracy (Khan et al., 2020). To get the maximum benefit from the remote sensing images based on CNN models when utilized in the applications the article mentioned above, the applied CNN models should be developed to have the highest accuracy possible and be able to extract the very tiny object's features accurately. This article developed a new CNN pyramidal model that focuses on improving the process of object detection from remote sensing images while taking into consideration the training time to be very small in comparison with the other existing CNN models. The suggested CNN architecture was compared with nine different pre-trained convolutional models and outperformed them. The suggested structure includes three main transitions, which are illustrated in Fig. 1: (i) gathering and preparation of datasets, (ii) development of CNN architecture, (iii) the proposed CNN model is evaluated, examined, and contrasted with several pre-trained models. This article's major contribution is:

  1. 1.

    Proposing a robust CNN model that is developed and employed in the state of the art of an optimized layering structure and fine-tuned hyper-parameters.

  2. 2.

    Utilizing the impact of traditional pre-trained CNN models on the classification process of objects in remote sensing images.

  3. 3.

    Comparing the performance of the suggested CNN algorithms and nine well-known pre-trained models based on standardized datasets.

Fig. 1
figure 1

Main stages of the CNN system

The remainder of this paper is structured as follows: Section "Literature Review" summarizes the literature review. Section "Approach Preprocessing" goes through the proposed approaches in depth. Section "The Proposed CNN Model Approach" provides the suggested method’s experimental outcomes. The section "Experimental Results and Discussion" brings the research to a conclusion.

Literature Review

Classification

Kumar et al. (2021) examined the outcome of pre-training 16 different convolutional neural network algorithms on the ImageNet database and tuned these models for the challenge of recognizing numerous items in very high-resolution pictures. They indicated that using pre-trained algorithms would reduce the demand for vast volumes of very-high-resolution pictures.

Li et al. (2019) used four DNNs to construct comparable classification approaches in metropolitan built-up environments (CNN, SMDTR-CNN, CapsNet, and SMDTR-CapsNet). In terms of various metrics, the offered methodologies’ accomplishments have been confirmed.

Liang et al. (2020) utilized a two-stream satellite imagery picture categorization system. Furthermore, the merging of CNN and GCN helps the developed system to simultaneously learn object-based spatial aspects and global-based visual characteristics. The framework acquires the appearance properties of the entire picture and the spatial dependence between items at the same time, thereby reducing visual confusion and improving feature discrimination.

Xu et al. (2021) proposed an improved classification approach for land categorization using remote sensing pictures that combine recurrent neural network (RNN) and random forest (RF). Object and pixel categorization are used for classification.

Cheng et al. (2020) described the primary issues of satellite images categorization and conducted and presented: (i) auto-encoder-based satellite images classification; (ii) CNN-based satellite image object detection methods; and (iii) methods for detecting satellite photographs using generative adversarial networks.

Dong et al. (2020) developed an approach for classifying very-high-resolution satellite pictures depending on the merging of a random forest (RF) classifier with the CNN. The fusion with the RF had a great enhancement on the task of the relevant variables selection.

Ma et al. (2021) presented the SceneNet approach for image categorization network architecture discovery based on the neural evolution of multi-objective. The system searching and architecture coding in SceneNet are accomplished by the application of an evolutionary technique, which may construct an improvement in the hierarchical extraction of satellite image information.

Priya and Vani (2019) proposed a convolutional neural algorithm for fire detection. The algorithm is based on Inception-v3 with the transferred learning-based system that has been trained using satellite pictures for the process of classifying images into fire and non-fire images.

Unnikrishnan et al. (2019) proposed innovative deep learning designs for three different networks (AlexNet, VGG, and ConvNet) developed by hyper-tuning the model and using 2 bands of data as the input. The redesigned models using the 2-band input and a decreased layers number are trained and tested in order to categorize photographs into distinct groups.

Rohith and Kumar (2022) built a 13-layer CNN architecture for the process of classifying raw remote sensing images of the National Remote Sensing Center (NRSC) dataset.

Zhao et al. (2019) investigated the feature representation capacity of multiple classifiers from the perspective of categorization of satellite imagery. Furthermore, 4 pre-trained CNN algorithms and 3 popular databases are chosen, compared, and summarized.

Özyurt (2020) proposed feature extractors, VGG19, VGG16, Alexnet, ResNet, SqueezeNet, and GoogleNet pre-trained architectures that were employed. They acquire features from the architecture’s final fully connected (FC) layers, and to produce suitable features, the article used the ReliefF approach for the process of selecting features. The convolutional neural features are then sent into the support vector machine (SVM) classification algorithm rather than the FC layers of CNN to measure the performance.

Detection

Zalpour et al. (2020) proposed an oil tank identification framework based on oil depot detection by employing deep characteristics. First of all, oil stores are retrieved using a faster R-CNN. Second, for suitable target selection, a quick circle detection algorithm is used. For feature extraction, they coupled CNN and HOG. Finally, the SVM classifier is utilized for the process of classifying the images.

Chen et al. (2022) used the transfer learning approach in order to overcome the overfitting issue. For the purpose of finding airplanes in remote sensing photographs, the Domain Adaptation faster R-CNN (DA faster R-CNN) algorithm is suggested. The DA faster R-CNN detection technique is applied to the DOTA dataset for the detection of aircraft for the process of addressing the detection challenge resulting from the poor quality of remote sensing photos.

Darehnaei et al. (2022) suggested swarm intelligence ensemble deep transfer learning (SI-EDTL), for the purpose of detecting various vehicles. Faster regional-based convolutional neural networks (faster R-CNN) are employed in this article. The region proposal network (RPN) is applied for extracting various regional proposals, and CNN is then employed to choose the most evocative characteristics of that region to identify objects. They utilized three different faster R-CNN as learning algorithms that trained on the ImageNet dataset, along with five transfer classification models in order to categorize the region of interest into four vehicle classes taken from the UAV dataset.

Feng et al. (2019) utilized faster R-CNN to locate vehicles in satellite pictures. They investigated the effects of the size of objects and the pooling technique on the regional proposal, and then, in order to improve the detection accuracy of multi-scale objectives, a new strategy for region recommendation was developed. Several tests are used to show the efficiency of the developed multi-scale object classification algorithm for remote sensing images.

Napiorkowska et al. (2018) utilized the FCN-VGG network to recognize three distinct items or characteristics in satellite imagery which are roadways, palm plants, and vehicles taken from Deimos-2 and Worldview-3 datasets. The outcomes indicate that the suggested strategy is successful at locating objects with various colors and forms, which conventional satellite imagery approaches such as RF or SVM cannot do.

Karnick et al. (2022) applied a Multi-Scale Swift Detection System, which is a fully convolutional network architecture, on the COWC dataset to locate cars. This detection technique employs a modified version of YOLO known as the YOLT architecture, which scans test photographs of arbitrary size using bounding boxes to find the vehicles.

Karim et al. (2019) proposed a training approach that relies on compressed and down-scaled photographs to assess the influence of automobile compression techniques and down-scaling on prediction performance.

Sharma et al. (2021) built YOLOrs, which is a novel CNN proposed for object recognition in multimodal remote sensing pictures. The utilized approach used for vehicle detection was compared with various modern techniques to prove its strengths.

Zhang et al. (2022) proposed an MFRC detection technique based on faster R-CNN for the detection of airplanes. Three steps were used in developing the suggested framework: to begin, K-means is used to combine the airplane enclosing areas while also enhancing the region detection in RPN. Second, the pooling layers of the VGG16 network are reduced from four to two in order to achieve the characteristic of small-scale airplanes. Finally, Soft-NMS is employed to improve the airplane’s frame.

Zhu et al. (2020) proposed an innovative satellite images object recognition technique that employs a fusion-based feature reinforcement component (FB-FRC) to enhance object feature discrimination. Two fusion techniques are proposed in detail: (i) a hard fusing approach and (ii) a soft fusing technique.

Cui et al. (2021) collected global and local characteristics simultaneously, to introduce a detection mechanism that uses the dual-channel deep learning (DCDL) method. In order to construct local mining and residual calculations on the image in the first stage, they used a multiscale convolution residual network. Secondly, the local concentration approach is used to limit the information by assigning weight factors to local attributes. Lastly, 2-layer convolution is applied to achieve deep feature mining in order to detect three separated classes from the NWPU-RESISC45 dataset.

Segmentation

Diakogiannis et al. (2020) innovated a powerful deep learning modeling approach for the segmentation of high-definition satellite photographs introduced ResUNet-a; their deep learning architecture is built on the encoder/decoder concept, with typical convolutions substituted with ResNet modules.

Pang and Gao (2022) introduced the MAGC-Net neural network algorithm for pixel-level classification of ocean satellite imagery pictures, which is built on a multi-head attention technique supervised by Conv-LSTM. The outcome demonstrates that the suggested three Conv-LSTM layers that analyze deep features in this network utilize the multi-head attention technique to fully exploit the number of hosts, substantially decreasing features and fusion of features enhancing.

Table 1 summarizes the most recent research articles that discuss the problem of recognizing multiple objects in remote sensing images.

Table 1 A complete literature review of recent research articles in state of the art of remote sensing image classification

The literature review represents that remote sensing applications can be categorized into different categories based on multiple ways:

  1. (1)

    Application (classification, detection, and segmentation).

  2. (2)

    Acquired image quality (high-resolution images, low-resolution images).

  3. (3)

    Objects (multiple object detection, single object detection).

  4. (4)

    Image capturing distance from the ground (satellite imagery, drone imagery)

In this article, we are utilizing low-resolution multiple-object satellite images for classification purposes.

Approach Preprocessing

Image Acquisition

Images are collected from two datasets, which are summarized in Table 2: (a) Northwestern Polytechnical University (NWPU) published the NWPU-RESISC45 dataset, which is released for Remote Sensing Image Scene Classification (RESISC). This data collection comprises 31,500 photographs divided into 45 environment classes, every with 700 images (Cheng et al., 2017). Dataset samples are illustrated in Fig. 2. (b) The UC Merced Land Use dataset contains 21 land-use types represented by aerial photographs (256 × 256 dimensions in RGB). The classes consist of 100 photographs each (Yang & Newsam, 2010). Figure 3 illustrates the samples from the dataset.

Table 2 The dataset details
Fig. 2
figure 2

NWPU-RESISC45 dataset samples

Fig. 3
figure 3

UC Merced dataset samples

Image Preprocessing

Image preprocessing is a large field of study that contains many fields including image resizing and image augmentation, which plays a huge role in deep learning applications and especially in object detection tasks (Kodali & Dhanekula, 2021; Marastoni et al., 2021). The proposed preprocessing steps are summarized in Fig. 4.

Fig. 4
figure 4

Preprocessing flow chart

Image Resizing

Image resizing is an important image preprocessing step, in which dimensions of input images are resized in order to be more suitable for the CNN architecture that the system is dealing with. Furthermore, it can enhance the overall accuracy and processing time (Kodali & Dhanekula, 2021; Pathak & Raju, 2022). This article has applied image resizing for all the input images depending on the pre-defined input size of the involved deep learning pre-trained model based on the Python resizing function, which has been used in recent articles (Vyas et al., 2022).

Image Augmentation

Image augmentation is an approach applied to expand the volume of information by adding slightly changed replicas of either current data or newly produced synthetic data from the already existing data. It functions as a regularizer and assists in preventing overfitting while developing a deep learning algorithm (Chlap et al., 2021; Khalifa et al., 2022). In this article, data augmentation has been involved by adding an extra version of data to increase both the validation and training input data amount by adding some changes to the data which was rotation with an angle between zero-degree and 180-degree, vertical flip, horizontal flip and zoom in with a 0.1 percent of the overall input image dimensions.

Model Optimization

Optimizers are techniques that adjust the deep learning algorithm’s properties including learning rate (Lr) and weights to increase the accuracy of the system. Optimizers are critical in decreasing the loss caused by the training phase (Manickam et al., 2021). This article proposed the Adam optimization technique starting with random weights and with a 0.00025 starting learning rate. Furthermore, the learning rate value was reduced by a factor of 0.25 using a learning rate reduction method, which is done if accuracy stays constant for multiple consecutive epochs with a max. of four overall reductions with a min. learning rate of 1 × 10−6 (Kingma & Ba, 1412).

Pre-trained CNN Algorithms

CNN has already made incredible progress, mostly in image processing techniques, and has rekindled academics’ interest in ANNs. Numerous research papers have been done in order to improve CNN's ability to complete tasks. CNN advancement may be divided into several categories, such as optimizations, regularization, deep learning architectures, and design improvements (Lei et al., 2020). This area of the site tracks advancements among the most prevalent convolutional networks. This article applied multiple pre-trained deep learning techniques in order to compare their accuracy with the model innovated within this article including:

  • The Vgg16 algorithm is a deep learning network composed of thirteen convolutional combined with three fully linked layers. It is divided into 41 pieces, including the SoftMax layer, the Max pool, the fully connected layer, the Relu layer, and the Dropout layer (Ye et al., 2021). The VGG16 input image default dimensions are (224,224,3).

  • The Vgg19 algorithm is a deep learning network composed of sixteen convolutional combined with three fully linked layers. It is divided into 41 parts, including the Max pool, the fully linked layer, the Relu layer, the Dropout layer, and the SoftMax layer (Li et al., 2020). The VGG19 input image default dimensions are (224,224,3).

  • AlexNet was created using deep learning methods. This design reduced the number of failures in computer image classification. Five convolutional layers, three pooling layers, and three fully connected layers are the main layers of AlexNet (Dhillon & Verma, 2020). AlexNet default picture size of the input image is (224,224,3). The equation summarizes the model represented in Eq. (1).

    $$A\left( M \right) = I\left( M \right) + X\left( M \right)$$
    (1)

    where the output target I (M) and the summed companion targets X (M) are each independently calculated in Eqs. (2) and (3).

    $$I\left( M \right)\; \equiv \;\left\| {} \right\|m\left( o \right)\left\| {} \right\|2\; + \;L\left( {M,\;m\left( o \right)} \right)$$
    (2)
    $$X\left( M \right)\; \equiv \;\Sigma N\;{-}\;1n\; = \;1\left[ {\left\| {} \right\|m\;\left( N \right)\left\| {} \right\|2\; + \;l\left( {M,\;m\left( n \right)} \right)\;{-}\;r} \right.$$
    (3)
  • MobileNet is composed of depth-separable convolution layers. The depth-wise convolution and point-wise convolution make up every depth-wise separable convolutional layer. If point-wise convolutions and depth-wise convolutions are computed individually, a MobileNet has 28 layers. A basic MobileNet contains 4.2 million parameters (Hou et al., 2020). MobileNet default picture size of the input image is (224,224,3). The depth-wise convolution is represented in Eq. (4).

    $$\hat{G}_{k,l,m} = \mathop \sum \limits_{i,j} \hat{K}_{i,j,m} \cdot F_{k + i - 1,j - 1,m}$$
    (4)

    where K is the depth-wise convolution kernel, F is the feature map input, and G is the created feature map.

  • ResNet is a more complex design with 152 layers than any other known architecture. It is made up of several residual blocks. The ResNet default input dimensions are (224,224,3). Equations (57) represent the ResNet model (Sarwinda et al., 2021).

    $$S_{1 + n}^{j} = A_{c} (S_{1 \to n}^{j} ,\;j_{1 \to n} )S_{I}^{j} \;n \ge I$$
    (5)
    $$S_{1 + n}^{j} = A_{a} (S_{1 + n}^{j} )$$
    (6)
    $$A_{c} \;(S_{1 \to n}^{j} ,\;j_{1 \to n} ) = S_{1 + n}^{j} - S_{I}^{j}$$
    (7)

    where \(A_{c}\)(\(S_{1 \to n}^{j}\), \(j_{1 \to n}\)) is a converted signal, and \(S_{I}^{j}\) is the I-th layer input. \(A_{c}\)(\(S_{1 \to n}^{j}\), \(j_{1 \to n}\)) and \(S_{1 + n}^{j}\) are the input of the next layer after the activation function \(A_{a}\) is applied.

  • DenseNet Traditional n-layer deep network function n connections, one between every level and the layer after it. DenseNet contains n (n + 1)/2 interconnection since each layer links to all the layers in some kind of feed-forward way. All previous layers’ local features are utilized as inputs to every layer, while its local features are employed as inputs in all following layers. DenseNet default input dimensions are (224,224,3) (Zhai et al., 2020). DenseNet improved the system by combining all of the image features successively rather than summarizing the resulting feature maps among all preceding layers, as shown in Eq. (7).where the layer index is denoted by n, while the nonlinear operations are denoted by G and the feature of the nth layer is represented by f.

    $${f}_{n}={G}_{n}(\left[{f}_{0},{f}_{1},{f}_{2}, \dots \dots ,{f}_{n-1}\right])$$
    (8)
  • LeNet is the first publicly available CNNs to get widespread recognition for their effectiveness on tasks involving computer vision. LeNet is composed of two parts: (a) a convolution encoder with 2 convolution layers and (b) a dense network with three FC layers. LeNet default input dimensions are (32,32,3). Equation (8) shows the process of estimating the output in the LeNet model (Bouti et al., 2020).where \({y}_{n}\) is the output layer, \({x}_{m}\) is the vector of the input, and \({\varnothing }_{nm}\) is the vector of the weights.

    $${y}_{n}=\sum {({x}_{m}-{\varnothing }_{nm})}^{2}$$
    (9)
  • Xception is a 71-layer convolutional neural model used in deep learning applications. The input is routed through the entering layers, the mid-layers, which is performed 8 times, and lastly the output layers (Jie et al., 2020). Xception model default input dimensions are (299,299,3).

  • Inception-V3 is composed of asymmetric and symmetric construction elements, containing dense layers, pooling layers, sequences, dropouts, and fully connected layers. Because of the inception modules inside its structure, it has a complicated architecture (Kumthekar & Reddy, 2021). The inception model default input dimensions are (299,299,3).

The Proposed CNN Model Approach

To achieve a deep learning algorithm based on CNN with enhanced performance and a small amount of error or loss, some parameters have to be chosen accurately such as optimization algorithm, layering structuring, filter sizes, batch size, activation function, learning rate, and a number of filters for a given dataset. The utilized CNN model was built using several max. pooling, convolutional layers, batch normalization, and dense layers. Combining, pairing, and layering are employed to construct an appropriate method that outperforms the known pre-trained models. Figures 5 and 6 illustrate the full layout of the suggested CNN algorithm. The proposed CNN structure starts with an input layer with an input image dimension of (128,128,3); after the input layer, there are five convolutional layers combined with 2 max-pooling and 2 batch normalization layers; all the convolution layers are implemented with (same) padding and RelU activation function as follows:

  1. (1)

    Convolutional layer with a kernel size of 3 × 3 and 128 filters implemented with (same) padding and RelU activation function.

  2. (2)

    Convolutional layer with a kernel size of 3 × 3 and 256 filters implemented with (same) padding and RelU activation function.

  3. (3)

    Max. pooling layer with a kernel size of 4 × 4.

  4. (4)

    Batch normalization.

  5. (5)

    Convolutional layer with a kernel size of 3 × 3 and 128 filters implemented with (same) padding and RelU activation function.

  6. (6)

    Convolutional layer with a kernel size of 3 × 3 and 256 filters implemented with (same) padding and RelU activation function.

  7. (7)

    Convolutional layer with a kernel size of 3 × 3 and 512 filters implemented with (same) padding and RelU activation function.

  8. (8)

    Max-pooling layer with a kernel size of 4 × 4.

  9. (9)

    Batch normalization.

  10. (10)

    Flattened layer with size equal to 2048 units.

  11. (11)

    The fully connected part of the suggested model consists of three dense layers using the Relu activation function of 1024, 512, and 256 units in the same order.

  12. (12)

    Batch normalization.

  13. (13)

    A dense layer of 10 units and SoftMax activation function acts as the output layer or the classifier.

Fig. 5
figure 5

The full layout of the proposed CNN architecture

Fig. 6
figure 6

The summary structure of the proposed CNN architecture

The rest of the parameters are included in the results Section “Parameters”. The number of nodes in the SoftMax layer matches the number of classes that the suggested method is capable of supporting. The loss of all the models in this article is measured based on the sparse categorical cross-entropy, and the activation function used is Adam. The suggested CNN model is based on the pyramid shape as shown in Fig. 5 and aims to enhance the performance of object classification in satellite imagery images to outperform the traditional pre-trained deep learning algorithms concentrating on accuracy and training processing time. The proposed method is named Pyramidal Net due to its pyramidal shape. The modification in the proposed model can be referred to the unique and optimized layering structure, which includes the selection of the number of filters, kernel size, and optimization technique. The encoding decoding part of the number of filters improves the feature extracting purposes, which increase the overall performance of the system.

Experimental Results and Discussion

Experiments Dataset Description

The firstly employed dataset includes 10 different classes from the NWPU-RESISC45 dataset, which are airplane, baseball court, desert, beach, overpass, roundabout, forest, stadium, harbor, and lake. Each class of the ten classes consists of 700 different images. Furthermore, in order to ensure the experimental findings, the UC Merced Land Use dataset has been utilized. The secondly employed database consists of 10 different classes from the UC Merced Land Use dataset, which are airplane, baseball court, chaparral, beach, overpass, parking lot, forest, tennis court, harbor, and agricultural. Each class of the ten classes consists of 100 different images. Those 10 classes in both datasets are chosen randomly to evaluate the model’s performance metrics and compare the proposed model with the pre-trained model’s performance. Both datasets are divided into 70% training set with 4900 images, 15% validation set with 1050 images, and 15% for testing with 1050 images distributed equally for the 10 classes.

Performance Metrics

Object detection from remote sensing image’ operational efficiency is measured by assessing the suitable accuracy, processing time, and complexity degree. Researchers could evaluate how parameter changes impact the model’s performance during the training process by exploring deep learning approaches. True-positive (TP), false-positive (FP), true-negative (TN), and false-negative (FN) measurements are calculated for the measurements (Bouguettaya et al., 2022; Singh et al., 2022). As a result, the following evaluation metrics have been calculated:

  1. 1.

    Accuracy is measured by the number of instances that were correctly detected. Accuracy is calculated by dividing the total number of correctly classified objects by the total number of classifications performed by the algorithm.

    $${\text{Accuracy}} = \frac{{{\text{TN}} + {\text{TP}}}}{{{\text{TN}} + {\text{TP}} + {\text{FN}} + {\text{FP}}}}$$
    (10)
  2. 2.

    Precision is calculated as the total number of true positive classifications divided by all of the algorithm’s positive classifications.

    $${\text{precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$
    (11)
  3. 3.

    The proportion of true positively categorized samples to all positively classed samples serves as a measure of recall.

    $${\text{recall}} = \frac{{{\text{TP}}}}{{{\text{FN}} + {\text{TP}}}}$$
    (12)
  4. 4.

    The F1 score is the weighted mean of precision and recall.

    $${\text{F}}1 - {\text{score}} = 2\; \times \;\frac{{{\text{precision}}\; \times \; {\text{recall}}}}{{{\text{precision}}\; + \;{\text{recall}}}}$$
    (13)
  5. 5.

    Intersection Over Union (IOU) is the measure of similarity. Furthermore, it can also be called Jaccard, and it is equal to the proportion between the total of real positive categories and all of the negative classifications.

    $${\text{IOU}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}} + {\text{FN}}}}$$
    (14)

Parameters

All the results were computed based on the parameters described in Table 3. Furthermore, as we discussed earlier in the article image augmentation has been utilized in all experiments in order to overcome the overfitting problem. Table 4 describes the applied augmentation techniques, which have been done on the input data. The number of epochs is fixed to be 30 epochs for all the utilized models to minimize the training time.

Table 3 Predefined parameters are assigned to the proposed system
Table 4 Augmentation techniques

Selection of Optimization Parameters

This section delves into the reasoning behind the article optimization parameter selections and investigates their impact on the accuracy of the proposed deep learning model for the classification of objects from remote sensing imagery. Choosing the Adam optimizer was one of the most important decisions made within the article. In comparison with the other tested optimization algorithms, Adam frequently converges faster and produces competitive performance with minimum hyper-parameter adjustment. The Adam optimizer was chosen based on theoretical reasons and empirical research. A series of tests with varying learning rates and batch sizes have been done to determine how the Adam optimizer affected the accuracy of the proposed object classification model. The proposed model was trained with initial learning rates that ranged from 0.001 to 1 × 10−7. We discovered that learning rates much above the optimal range resulted in unstable convergence and overfitting, whereas extremely low learning rates delayed the convergence period without appreciably improving accuracy. A starting learning rate of 0.00025 produced the optimal compromise in convergence speed and accuracy. The batch size utilized during training is another important component in optimization. Larger batch sizes frequently result in faster convergence, but they also raise the possibility of exceeding the ideal solution or becoming stranded in inefficient local minima. The proposed model tested batch sizes of 16, 32, 50, and 64. While bigger batch sizes accelerated convergence, they also showed evidence of overfitting on occasion. A batch size of 50 provided a good mix between convergence speed and generalization.

Experiment

In this article, a novel CNN model has been utilized and compared with nine different pre-trained models with respect to several performance indicators (accuracy, recall, precision, IOU, and F1-score) for the process of object detection on remote sensing images. The experiment can be categorized into three main stages starting with image resizing and augmentation, then the training of nine pre-trained convolutional deep learning algorithms that are VGG16, VGG19, AlexNet, DenseNet201, ResNet152V2, LeNet5, MobileNet, Xception, and InceptionV3 plus the proposed method, and finally the testing and comparison. The image input size of each CNN algorithm is represented in Table 5. The article has applied the default values of each model provided by the keras library, and for the proposed model we tested an input size of 224 × 224 × 3, did not make any changes in performance and also increased the training time while one of the objectives of the proposed model is to increase the accuracy with a small training time in comparison with other models. Algorithm 1 introduces the full processes of the three stages of the object detection process for the proposed model.

Table 5 Input image size of the proposed CNN algorithms
Algorithm 1
figure a

The Proposed Object Detection Technique

Table 6 introduces a complete comparison study between the proposed model and the most recent pre-trained deep structures with respect to test loss, test accuracy, train loss, train accuracy, validation loss, and validation accuracy taking into consideration the same number of epochs. Moreover, both confusion matrices as shown in Figs. 7, 8, 9, 10, 11, 12, 13, 14, 15 and 16 and accuracy and loss curves in Figs. 17, 18, 19, 20, 21, 22, 23, 24, 25 and 26 have been utilized and demonstrated. Accordingly, the proposed methodology has the best performance among all the tested pre-trained CNN models. Even if some pre-trained algorithms score a little close to the proposed algorithm as Xception, InceptionV3, and VGG16 CNN models, the proposed model training time was the smallest among the other pre-trained models.

Table 6 Comparison between different pre-trained CNN algorithms and the proposed model on the NWPU-RESISC45 dataset
Fig. 7
figure 7

VGG16 confusion matrix

Fig. 8
figure 8

VGG19 confusion matrix

Fig. 9
figure 9

AlexNet confusion matrix

Fig. 10
figure 10

MobileNet confusion matrix

Fig. 11
figure 11

ResNet152V2 confusion matrix

Fig. 12
figure 12

DenseNet201 confusion matrix

Fig. 13
figure 13

LeNet confusion matrix

Fig. 14
figure 14

Xception confusion matrix

Fig. 15
figure 15

Inception confusion matrix

Fig. 16
figure 16

Proposed model confusion matrix

Fig. 17
figure 17

A VGG16 loss curve, B VGG16 accuracy curve

Fig. 18
figure 18

A VGG19 loss curve, B VGG19 accuracy curve

Fig. 19
figure 19

A AlexNet loss curve, B AlexNet accuracy curve

Fig. 20
figure 20

A MobileNet loss curve, B MobileNet accuracy curve

Fig. 21
figure 21

A ResNet152V2 loss curve, B ResNet152V2 accuracy curve

Fig. 22
figure 22

A DenseNet201 loss curve, B DenseNet201 accuracy curve

Fig. 23
figure 23

A LeNet loss curve, B LeNet accuracy curve

Fig. 24
figure 24

A Xception loss curve, B Xception accuracy curve

Fig. 25
figure 25

A Inception loss curve, B inception accuracy curve

Fig. 26
figure 26

A Proposed model loss curve, B proposed model accuracy curve

In order to ensure the experimental result findings, both the proposed pyramidal CNN algorithm and the pre-trained algorithms have been applied to both the UC Merced Land Use datasets. The experimental results showed that the suggested pyramidal model outperforms all other CNN models that have been evaluated. Furthermore, the pyramidal model has a great performance dealing with small-sized datasets unlike most of the pre-trained models which have an overfitting problem as it is represented in Table 7.

Table 7 Comparison between different pre-trained CNN algorisms and the proposed model on the UC Merced Land Use dataset

In order to guarantee that the proposed model is size invariant and able to classify objects with different sizes, an additional experiment that categorizes the accuracy in terms of the size of the output classes has been done, by taking three classes from NWPU-RESISC45 dataset, small size object (airplanes), medium size object (stadium) and large size object (desert) each with 700 images. These 2100 images are then split into a 70% training set, a 15% validation set, and a 15% test set. In this experiment, to avoid overfitting all the convolutional layers filters count of the proposed model are divided by two. Tables 8 and 9 show the performance metrics of the proposed model over this experiment. Finally, Fig. 27 shows the accuracy and loss curves of the experiment, while Fig. 28 shows the confusion matrix of the three classes.

Table 8 The proposed model performance metrics of the three classes dataset individually
Table 9 The overall performance of the proposed model of the three classes dataset
Fig. 27
figure 27

A Proposed model loss curve of the three classes, B proposed model accuracy curve of the three classes

Fig. 28
figure 28

Proposed model confusion matrix of the three classes

According to Figs. 27 and 28, as well as the performance metrics illustrated in Tables 8 and 9, the proposed model can be considered as a size-invariant model, because of the high accuracy the proposed model had in this experiment. Furthermore, in order to ensure that the proposed model is illumination-invariant, the article examined the proposed model performance on a binary dataset that contains 700 images with clouds and 700 clear images (without clouds) mixed equally from all of the other nine classes used in the NWPU-RESISC45 dataset; these 1400 images are then split to 70% training set, 15% validation set, and 15% test set. Brightness augmentation with a brightness range between [0.2–2] was used during dataset preparation to strengthen the model's ability to handle shifting lighting conditions. This augmentation method adds to the model's robustness in varied illumination situations. In this experiment to avoid overfitting all the convolutional layers filters count of the proposed model are divided by two. Figure 29 shows the accuracy and loss curves of the experiment, while Fig. 30 shows the confusion matrix of the two classes. Finally, Table 10 shows the performance metrics of the proposed model over this experiment.

Fig. 29
figure 29

A Proposed model loss curve of a two-class dataset, B proposed model accuracy curve of a two-class dataset

Fig. 30
figure 30

Proposed model confusion matrix of two-class dataset

Table 10 The overall performance of the proposed model of a two-classdataset

According to Figs. 29 and 30, as well as the performance metrics illustrated in Table 10, the proposed model can be considered as an illumination invariant model, because of the high accuracy the proposed model had this experiment.

Finally, to guarantee that the proposed model has a positive effect on such remote sensing images classification performance the article compared the proposed method with some recently published research articles in the remote sensing images classification field, which is represented in Table 11.

Table 11 Comparison between the proposed method and related work for remote sensing images classification accuracy

Aspects Contributed to Accuracy Improvement

In order to enhance the performance of the proposed model, the article contributed with different parameters including the number of filters, layering structure, kernel size, and optimization techniques. The unique and optimized layering structure in the suggested model, which includes the choice of the number of filters, kernel size, and the overall hierarchical representation enhanced features extraction, which boosts the system's overall performance. Furthermore, a learning rate reduction technique has been done to improve convergence and model generalization. In addition, the proposed model applied data augmentation, which increased the proposed model's exposure to various data fluctuations, which increased the model's robustness. Later Adam was chosen as the optimization algorithm, which sped up convergence and improved the use of gradient data.

Challenges

In this section, the main potential limitations and challenges that the proposed model may encounter have been highlighted:

  1. 1.

    Limited data set variation: One of the limitations the proposed model faces is the lack of a dataset that contains variations in seasonal conditions, lighting, and environment.

  2. 2.

    Errors in labeling and annotation: The accuracy of the proposed model is largely dependent on the caliber of the labels applied to the training data. The performance of the model as a whole could be affected by inherent flaws in labeling or annotation.

  3. 3.

    The ability to transfer to other domains: Although our model is designed for object classification in satellite pictures, it may need additional tuning or adaptations in order to be used in other domains or datasets.

Areas of Improvement

  1. 1.

    Enhance data augmentation: utilize more diverse and complex techniques of data augmentation to make the model more resistant to variations in satellite photographs, such as changes in illumination, weather, and seasonal circumstances.

  2. 2.

    Preprocessing: Before incorporating satellite photographs into models, create preprocessing methods that improve noise reduction, feature extraction, and overall data quality.

  3. 3.

    Object localization: Increase the model's capacity to precisely localize and specify object boundaries inside the satellite pictures in addition to classifying objects.

  4. 4.

    Ensemble Models: Explore ensemble learning techniques by mixing different models or model variants to improve the overall classification accuracy.

Conclusion

This article introduces a robust CNN model structure for object detection from remote sensing images. The article focused on building an optimized CNN model with a novel structure and suitable hyper-parameters to be able to classify remote sensing images with a performance that exceeds the pre-trained models’ performances with the smallest possible training time. The article evaluates the effectiveness of the presented models based on both the NWPU-RESISC45 and UC Merced Land Use dataset. All findings have concluded that the proposed pyramidal CNN model structure has the highest detection accuracy with a very small training time in comparison with the well-known pre-trained CNN models, and it can be utilized efficiently for object detection processes from remote sensing images with an accuracy reaching 97.1%. The proposed model had a great performance dealing with different size object classes and different illumination datasets. Furthermore, the pyramidal model showed a great performance dealing with small datasets, unlike the traditional pre-trained datasets. Future work may include utilizing more optimization algorithms, deep learning models, and classes. In addition, improving the proposed CNN structure could be our future work of interest.