Keywords

1 Introduction

With the enhancement of remote sensing satellite images resolution, we can obtain more information to conduct more impactful research related to land planning, disaster prevention and so on. Classification is one of the most important tasks in HSRRS image processing. HSRRS images have rich spatial, shapes, textures and colors features, which provide good basis for the classification [1].

Deep learning (DL)-based image classification methods usually adopt convolutional neural networks (CNNs) to automatically extract image feature, which gets rid of the complex artificial feature design [2,3,4]. Generally, the deeper the neural network, the better the classification performance [5]. On account of its excellent feature extraction capabilities, CNNs have been widely applied in the tasks of signal modulation recognition [6, 7], voice signal processing and other fields. In recent years, CNNs have developed rapidly, and more and more image classification models based on CNNs achieved excellent results like VGG [8], ResNet [9], InceptionV3 [10] and so on. However, these classification models mainly rely on a great quantity of training samples to obtain high accuracy. Otherwise overfitting may occur [11], which means that trained models have the prefect performance on training samples, while perform badly on the independent test samples. Due to the difficulty of acquiring HSRRS images, the dataset of HSRRS is small. In recent years, a great deal of research have shown that transfer learning could greatly alleviate the phenomenon of over-fitting of the classification model with small sample size [12]. Y. Boualleg et al. proposed a CNN-DeepForest based on deep forest and CNNs transfer learning for HSRRS images classification [13]. Xue et al. proposed a MSDFF model based on multi-structure deep features fusion for HSRRS image classification [14]. However, there is still a gap in the drive to find optimum sample size for transfer learning to achieve the best classification performance.

In this paper, we propose a TL-DRN model for HSRRS image classification. The proposed model is used to train ten groups of datasets with different sample sizes to explore the influence of the sample size on the model training. The mean accuracy is used to evaluate the performance of the model. The main contributions of this paper can be summarized as below:

  1. 1.

    TL-DRN for HSRRS image classification with limited sample size is proposed. Experiments have proved that the TL-DRN model is more suitable for HSRRS image classification in the case of small sample size than the DRN model.

  2. 2.

    The impact of sample size on TL-DRN are studied. Experimental results confirm that the performance of TL-DRN tends to be stable when the sample size of training reaches six per category.

2 Theoretical Basis

In this section, CNNs, ResNet50 and transfer learning are introduced in detail.

2.1 CNNs

CNNs is a special artificial neural network. Its main feature is the ability to perform convolution operations. Therefore, CNNs is excellent in image classification, detection and segmentation [15].

The input of CNNs is often raw data such as images and audio. The structure of CNNs is a hierarchical model composed of convolutional layers, pooling layers, fully connected layers [16], and activation functions. The original input information undergoes layer-by-layer operation to extract feature information. Then, those information is used for classification through the fully connected layer [17].

When designing a CNNs, the number of channels in the convolution layer should be equals to the input data. The number of convolution kernels should be the same as that of channels output from this layer. The convolution kernel generally has two attributes: stride (s) and padding (p). Output size of the feature layer after convolution can be obtained by the following calculation formula [18]:

$$\begin{aligned} n = \frac{{N + 2p - f}}{s}+1 \end{aligned}$$
(1)

where N, f and n represent the size of the input, the convolution kernel and the output respectively.

When the image passes through the convolutional part, low-level convolutional layers extract low level semantic features like texture and shape etc., while high level convolutional layers extract high level semantic features [19]. In general, high level semantic features are more convenient for image classification. Finally, the feature information output by the convolutional layer is mapped to the labeled sample space through the fully connected layer to complete the classification task.

2.2 ResNet50 and DRN

As the number of convolutional layers is increased, the high-level semantic features of the image can be better extracted. However, the deep network may have the problem of gradient disappearance or explosion, which hinders the convergence of the network, otherwise known as the degradation problem. To solve the issue of degradation, ResNet50 is proposed. The ResNet50 network is a stack of residual networks. The Fig. 1 shows the structure of the residual network. The principles of the residual network are as follows.

The residual network consists of one residual unit. First, the residual unit can be written as

$$\begin{aligned} {y_l} = h({x_l}) + F({x_l},{W_l}) \end{aligned}$$
(2)
$$\begin{aligned} {x_{l + 1}} = f({y_l}) \end{aligned}$$
(3)

where \({x_l}\) and \(x_{l+1}\) respectively represent the input and output of the l-th residual unit [20], and every residual unit includes a multi-layer structure generally. F is the residual function, indicating the learned residual. In addition, \(h({x_l}) = {x_l}\) indicates the identity mapping and f represents the rectified linear unit (relu) activation function which is expressed in Eq. (4). The way of adding a highway between the output and input of the network allows us to easily solve the problem of gradient dispersion and network performance degradation.

$$\begin{aligned} f(x) = \max (x,0) \end{aligned}$$
(4)
Fig. 1.
figure 1

The architecture of the residual block.

It should be note that the convolution part of the ResNet50 network is called as deep residual network (DRN) for the convenience of writing.

2.3 Transfer Learning

Transfer learning is a learning method for small sample training [21]. It applies the knowledge and experience learned in other tasks to the current task.

In transfer learning, domains (D) and tasks (T) are defined, domains are divided into source domains (\({D_s}\)) and target domains (\({D_t}\)), and tasks are divided into source tasks (\({T_s}\)) and target tasks (\({T_t}\)). The domain includes feature space and edge probability distribution. Given \({D_s}\), \({T_s}\), \({D_t}\), and \({T_t}\), transfer learning uses the knowledge learned from \({D_s}\) and \({T_s}\) to enhance the learning of the prediction function f for \({D_t}\), where \(T=f(D)\), \({T_s}\ne {T_t}\) and \({D_s}\ne {D_t}\) [22].

In terms of image classification processing, some studies have found that no matter which image dataset is input into the CNNs, the features extracted from the low level convolutional layers are similar. A great deal of researches have proved that the features extracted by a dataset after CNNs are often applicable to another dataset [23]. Therefore, based on this feature, we conduct transfer learning on small samples.

3 The Proposed HSSRS Classification Based on DRN and Transfer Learning

In this section, the structure of TL-DRN network, the training method of TL-DRN, the objective function and evaluation indexs used in the experiments will be introduced in detail.

3.1 The Structure of TL-DRN

TL-DRN is a deep learning network model composed of DRN and transfer learning. First, we build the convolution part of the TL-DRN according to the structure of DRN to prepare for migration. After that, two sets of fully connected layers are added behind the DRN for classification. The framework of the model is presented in Fig. 2.

Fig. 2.
figure 2

The framework of ResNet50 and TL-DRN.

3.2 The Method of TL-DRN Training

The TL-DRN training method consists of three parts: ResNet50 training, network reconstruction and feature transfer, and TL-DRN training. The process is presented in Algorithm 1.

ResNet50 Training. ResNet50 is composed of CNNs and residual blocks, which can solve the gradient explosion problem caused by the increase of the network level. Figure 2 shows the model structure. ImageNet [24] dataset is used as the source domain to training ResNet50 and it can be expressed as:

$$\begin{aligned} D_{S}=\left\{ \left( x_{1}, y_{1}\right) ,\left( x_{2}, y_{2}\right) , \ldots ,\left( x_{n}, y_{n}\right) \right\} \end{aligned}$$
(5)

where \(\boldsymbol{x}_{i}=[x_{i}^{(1)}, x_{i}^{(2)}, \ldots , x_{i}^{(k)}]^{T}\) \((i=1,2, \ldots \ldots , n)\), \(x_{i}^{(j)}\) represents the j-th feature of the i-th input data of source domain, \(y_{i}\) represents the true label category of source domain. The model can be expressed as:

$$\begin{aligned} F_{S}=f_{ResNet50}(\theta _{DRN},\theta _{fc}; x_{i}) \end{aligned}$$
(6)

where \(F_{S}\) represents the output of the model. \(\theta _{DRN}\) indicates the weight parameter obtained by training the deep residual convolution part and \(\theta _{fc}\) indicates the weight parameter obtained from the fully connected layer training.

figure a

Network Reconstruction and Feature Transfer. TL-DRN is constructed according to Fig. 2. Since ResNet50 and TL-DRN have the same DRN structure, the weight of the convolution part obtained by ResNet50 training can be extracted and loaded into the convolution part of TL-DRN. The dataset used in TL-DRN is ten-category images of HSRRS. In transfer learning, this dataset represents the target domain, so it can be written as:

$$\begin{aligned} D_{T}=\left\{ \left( x_{1}, y_{1}\right) ,\left( x_{2}, y_{2}\right) , \ldots ,\left( x_{n}, y_{n}\right) \right\} \end{aligned}$$
(7)

where \(\boldsymbol{x}_{i}=[x_{i}^{(1)}, x_{i}^{(2)}, \ldots , x_{i}^{(k)}]^{T}\), \(x_{i}^{(j)}\) represents the j-th feature of the i-th input data of target domain, \(y_{i}\) represents the true label category of target domain. The model can be expressed as:

$$\begin{aligned} F=f_{T L - D R N}(\theta _{DRN},\theta _{newfc}; x_{i}) \end{aligned}$$
(8)

where, the value of \(\theta _{DRN}\) in Eq. (8) is the value of \(\theta _{DRN}\) trained in Eq. (6). \(\theta _{newfc}\) indicates the weight parameter obtained from the TL-DRN fully connected layer training.

TL-DRN Training. After network reconstruction and feature transfer, the TL-DRN model can learn the knowledge \(\theta _{DRN}\) obtained by ResNet50 training. Therefore, we only need to train the parameters \(\theta _{newfc}\) of the fully connection layer.

3.3 Loss Function

The multi-class categorical cross entropy loss function is used as the loss function and it can be written as:

$$\begin{aligned} L=-\sum _{i=1}^{N} y_{i} \log \left( F_{i}\right) \end{aligned}$$
(9)

where \(y_{i}\) is the true label of i-th input and \(F_{i}\) is the result of the i-th output of the model. N is the number of categories. We aims to train the model to find a suitable set of \(\theta \) to minimize the loss function.

3.4 Evaluation Index of Experimental Results

Firstly, line chart is used to show the training process, where the horizontal axis represents the epoch of training, the left vertical axis indicates the mean accuracy, and the right vertical axis indicates the loss value.

Secondly, mean accuracy (MA) judge the overall performance of the classification and its calculation formula can be written as [25]:

$$\begin{aligned} \mathrm {M} \mathrm {A}=\frac{1}{N} \sum _{i=1}^{n} C_{i i} \end{aligned}$$
(10)

where N is the total sample size of the testset, n is the total number of categories to be classified, \(C_{i i}\) is the number correctly classified for class i.

4 Experimental Results

In this section, we collect different datasets, conduct experiments according the Section III, and analyze the results of the experiments.

Fig. 3.
figure 3

Accuracy and loss line chart during the period of training and testing for training sample size is (a) one, (b) ten per category with TL-DRN.

4.1 Data Description

From the UCMerce LandUse [26] and RSI-CB128 datasets [27], ten-category HSRRS image samples are selected for classification. One image per category is randomly selected to form a training dataset. Afterwards, \(\{2, 3, 4, \cdots \!, 10\}\) images are selected in the same way as in the above. These ten sample sets are used as small training samples to explore the influence of sample size on TL-DRN classification. Ten images per category are selected randomly to form a training dataset. After that, \(\{20, 30, \cdots \!, 100\}\) images per category are selected in the same way as in the above. As a control group, these 10 sample sets with larger training samples are directly trained by ResNet50. The specific training sample size is shown in Table 1 and Table 2. In the experiment, all test sets used are the same, with one hundred samples in each category. It should be noted that the following operations are performed in the image preprocessing stage: (1) The input picture size is uniformly changed to the format of \(224\times 224\times 3\); (2) The data is effectively expanded by rotating and translating the image. Therefore, the data set has been expanded twice on the original basis.

4.2 Experiments Setting

There are three experiments in this part. The experiment one explores the influence of sample size on the transfer learning model. The experiment two and experiment three serve as control learning groups. First of all, for the first experiment, ten sets of training data with a small sample size were selected. The sample sizes of these ten training sets are \(\{1,2, 3,\cdots \!, 10\}\) per category. The TL-DRN model is built according to Fig. 2. Experiments with 10 sets of data are conducted according to the TL-DRN training method in Section III. Secondly, for the second experiment, the same data set as TL-DRN experiment is selected. The ResNet50 model is built according to Fig. 2 and used to train 10 sets of training samples separately. At last, for the third experiment, ten sets of training data with a larger sample size are selected. The sample sizes of these ten training sets are \(\{10,20, 30, \cdots \!, 100\}\) per category. After that, ResNet50 is used to train ten sets of training samples directly.

Fig. 4.
figure 4

Accuracy and loss line chart during the period of training and testing for training sample size is (a) one, (b) ten and (c) one hundred per category with ResNet50.

Table 1. Testing set mean accuracy on TL-DRN and ResNet50.
Table 2. Testing set mean accuracy on ResNet50.

4.3 Experimental Results

First of all, the line graphs of the training process of the three experiments when the sample size is the smallest and the sample size is the largest are shown in Figs. 3 and 4. Each line graph shows the changes of the accuracy and loss of the training set and testing set as the number of training iterations increases. Comparing the training process diagram with the smallest sample size and the largest sample size of each experiment, we can see that as the sample size increases, the overfitting problem of the model is better mitigated. For example, the MA of the training set and testing set of Fig. 3(a) differs by about 20% while differs by about 6% of Fig. 3(b). Moreover, for small samples, transfer learning can reduce the overfitting situation (This can be understood by comparing to the image of Fig. 3(b) and Fig. 4(b)).

For the first experiment, the test MA obtained by the test experiment in the case of each sample size are given in Table 1. It can be seen clearly that with the increase of the sample size, the classification performance of our proposed TL-DRN is also greatly improved. However, the classification performance of the test fluctuates little, by about 0.3% when the sample size reaches six per category. The MA of the testing set is around 94.4%.

For the second experiment, comparing the test results of TL-DRN and the test results of ResNet50 in the same sample size in Table 1, it can be found that transfer learning greatly improves the performance of classification with small samples size. At the highest level, TL-DRN improves accuracy by nearly 39% compared to ResNet50, and at the lowest level, it has a nearly 12% improvement.

For the third experiment, the test MA obtained by the test experiment in the case of each sample size are given in Table 2. Compared with the first experiment, in the case of ten times the training sample size of the first experiment, the classification performance of ResNet50 is slightly better than that of TL-DRN generally. At the highest level, ResNet50 improves accuracy by nearly 7.6% compared to TL-DRN, and at the lowest level, it has a nearly 2.7% drop.

5 Conclusion

In this paper, the influence of the sample size on the classification of TL-DRN model for ten-category of HSRRS images was investigated. When the sample size of ten types of HSRRS images reaches six per category, the classification performance of TL-DRN network tends to be stable. In addition, the classification effect of the TL-DRN model is far better than that of ResNet50 on training samples of the same magnitude. And when the training sample size of ResNet50 is 10 times that of TL-DRN, the classification effect of TL-DRN is only slightly lower than ResNet50. It was also shown that TL-DRN is a good candidate for classification of HSRRS images. However, when the sample size increases to a certain level, the continued increase of sample size has little effect on performance. We will continue to be committed to improving the classification performance of the model through other methods in the case of small samples.