1 Introduction

Hyperspectral Image (HSI) has the characteristics of map unity, rich spatial information, wide range of spectral bands, high resolution, etc., which enhances the ability of remote sensing (RS) to observe the ground and the ability of feature identification, and has been widely utilized in many fields, such as military exploration [1], environmental monitoring, precision agriculture and medical diagnosis [2], etc. Hyperspectral image classification (HSIC) is one of the basic problems in RS image processing, and it is also the basis and key of RS image analysis and interpretation. The main goal of HSIC is to recognize the actual ground object from the image, i.e., to give a unique category to each pixel in the image. In early HSIC, traditional machine learning algorithms such as support vector machine (SVM) [3], random forest (RF) and logistic regression (LR) mostly focus only on spectral information. However, the same ground object has spectral differences in different spaces, and different ground objects may also have similar spectral characteristics. Therefore, since such methods ignore the rich spatial structural features, resulting in classification results that often contain a large amount of noise, it is difficult to achieve accurate classification of complex features [4], so integrating spectral and spatial information is an effective way to improve the HSIC results. Considering that there is often interrelated information between spatially neighboring image elements, methods such as Markov Random Fields [5] and Morphological Attribute Profiles have been used to obtain the spectral and spatial information of HSI with good results. However, the above hand-crafted spatial–spectral features rely heavily on rich expertise, and the shallow features extracted have limited impact on the enhancement of classification precision.

Over the past few years, DL-based image classification techniques have become increasingly popular in HSIC [6]. By using these methods, it is possible to automatically extract abstract features from low-level semantics to high-level semantics in images with better representation performance, which makes the subsequent classification results more accurate. Convolutional Neural Network (CNN) is a prime example of a DL model that exhibits superior performance in feature extraction and classification. 1DCNNs are only able to identify the spectral characteristics of images for HSI pixel-level categorization [7]. In order to make the most of the spectral and spatial features of HSIs, scholars have proposed 2D and 3D CNN models in succession. Among them, Zhao et al. [8] used a two-dimensional CNN (2DCNN) for HSIC, which considered the important role of the spatial information of HSI. However, extracting spectral and spatial features individually does not take full advantage of the combination of spectral and spatial data and requires complex pre-processing. Considering that HSI has a 3D cubic structure, Li et al. [9] directly used 3D convolution to obtain the spatial–spectral features of HSI and achieved the improvement of HSIC accuracy. 3DCNN is more computationally intensive than 2DCNN and has higher memory requirements. Zheng et al. [10] sought to simplify the model while still maintaining a high level of classification accuracy. They created a mixed convolutions and covariance pooling model (MCNN-CP) by combining the advantages of 3DCNN and 2DCNN, and verified the potential of hybrid convolution in HSIC. Fırat et al. [11] introduced a depthwise separable convolution based on a 2D–3D hybrid convolution model, which effectively improves the accuracy of HSIC. However, the above models tend to ignore the variability in the importance of different features affecting the classification results. Considering the difference in the contribution of different types of features to the classification results [12], Shi et al. [13] suggested a 3D coordination attention mechanism network (3DCAMNet). Specifically, they used a combination of 3D convolution and attention mechanism to ascertain the disparity in significance between various spectral bands, which ultimately achieves the improving of the model performance. However, with the deepening of the network structure, the performance gradually decreases and the degradation problem easily occurs.

Residual connections in Residual Networks (ResNet) [14] can deepen the number of network layers and optimize the model structure. Qing et al. [15] introduced residual connections in 2DCNN to improve the HSIC accuracy. He et al. [16] combined 3DCNN and residual connection to construct a HSIC model, which still has some room for improvement in classification performance due to not fully utilizing the spatial–spectral information of HSI. To address the above problems, Cao et al. [17] built a comprehensive hybrid convolution residual network (BHModel) to improve the feature learning of HSI, which uses 2D–3D convolutional mixing to drastically decrease the amount of parameters, thus making the network architecture simpler. The single way of feature fusion and the underutilization of shallow features lead to certain limitations in the improvement of classification effect. Dang et al. [18] built a lightweight model (JPModel) for HSIC. The method reduces the network parameters and increases the paths for learning features by combining residual connection with depthwise separable convolution, which further improves the classification accuracy. He et al. [19] used a multi-scale residual network (SSMRN) to obtain the spectral-spatial information of HSI, which effectively learns the target features. Lei et al. [20] introduced capsule residual blocks to increase the depth of the network, and although more critical information was extracted, the complexity of the proposed MS-CapsNetW was relatively high.

The residual network described above can cope with the phenomenon of degradation, but it is also hindered by the slow network speed and underutilization of extracted features due to direct convolution operation on the data. Pre-activated residual connection can reduce the complexity of the model and make the model converge faster. So introducing the pre-activation mechanism [21, 22] into the residual network can improve the network structure of the original residual module, which can not only improve the training speed, but also obtain deeper features with stronger representativeness in the joint learning of spatial–spectral features. To address the problem of insufficient feature utilization, for deep network models, the long and short distance residual connection approach [23, 24] can solve the problem of gradient vanishing on the one hand, and on the other hand, for the loss of feature information caused by convolutional operation, this approach can also achieve the connection between the bottom and the top network, so as to ensure the stability of the training of the deep network. Consequently, the strategy partially compensates for the missing data and contributes to the enhancement of the feature fusion capability.

Inspired by the above research works, this study proposes a HSIC model PMCRNet on the basis of 2D–3D hybrid convolution and pre-activated residual network. PMCRNet can effectively compensate for the defect of incomplete feature extraction and enhance the computational performance. The major contributions of this study are as follows.

  1. 1.

    By utilizing a combination of hybrid convolution and a pre-activated residual module, we can obtain the deep spatial–spectral joint features of HSI, which can accelerate the convergence speed of the model and minimize the amount of parameter computation while enhancing the ability of advanced spatial-semantic feature extraction.

  2. 2.

    We adopt the long and short distance residual connection to effectively fuse shallow and deep information to obtain advanced semantic features. This approach further increases the expressiveness of the features and addresses the gradient vanishing issue in deep network when the number of layers increases.

  3. 3.

    We experimentally validate and comparatively analyze the model of this study with seven other state-of-the-art related models on three publicly available HSI datasets. The HSIC experimental results show that the proposed PMCRNet outperforms other HSIC methods.

2 Residual neural network

In classification tasks, shallow networks have limited feature extraction capabilities, so more complex and rich features need to be learnt by building deeper networks. However, as the network layers continue to deepen and become saturated there will be a decrease in model performance. Therefore, relying solely on increasing the number of network layers will not necessarily improve classification accuracy. He et al. proposed Resnet to effectively avoid the problem of gradient degradation. The core of ResNet is Residual Building Block (RBB), which mainly consists of convolutional layers, batch homogenization layer, and activation function ReLU.

2.1 Convolutional layer

The convolutional layer is employed to acquire the feature information from the input, which is composed of multiple convolutional units. The back propagation algorithm optimizes the parameters of each convolutional unit. The features are extracted by regular shifting and convolution operations on the input image by different sized receptive fields. The computational process of the convolutional layer is defined as

$$ x_{j}^{l} = f\left( {\sum\limits_{{i \in M_{j} }} {x_{i}^{l - 1} * k_{ij}^{l} + b_{j}^{l} } } \right) $$
(1)

where \(x_{j}^{l}\) represents the jth feature map of layer l,\(f( \cdot )\) is the activation function,\(M_{j}\) denotes the set of input feature maps,\(x_{i}^{l - 1}\) represents the ith feature map of layer l-1,\(k_{ij}^{l}\) denotes the weight from the ith convolution kernel in layer l-1 to the jth convolution kernel in layer l, * denotes a convolution operation,\(b_{j}^{l}\) denotes the bias term of the jth convolution kernel of layer l.

2.2 Batch normalization

For the network, Batch Normalization (BN) can speed up the convergence and improve the generalization ability. By deriving the mean and variance of each batch of data for normalization, it is possible to make each layer of information within the effective range that can be passed on to the next layer, and the process is computed through

$$ w = \frac{v - c(v)}{{\sqrt {d(v)^{2} + \varepsilon } }} $$
(2)

where \(w\) is the activation value normalized to the network, \(v\) is the activation value of a particular layer of the network, \(c(v)\) and \({\text{d(}}v{)}\) represent the mean and the variance, respectively.\(\varepsilon\) is a constant close to zero.

2.3 Activation function

Rectified Linear Unit (ReLU) is the most commonly used activation function in neural networks, which can help to reduce the issue of vanishing gradients and speed up the convergence of networks. The function is calculated as follows:

$$ {\text{Re}} {\text{LU}}(z) = \max (0,z) $$
(3)

where \(z\) and \({\varvec{ReLU}}(z)\) are the input and output of the activation function, respectively.

2.4 Residual module

RBB uses shortcut connections to skip blocks of convolutional layers for efficient transfer of information, avoiding gradient explosion and vanishing, which helps construct deeper neural network structures and enhances the ultimate efficiency of the network.

Figure 1a shows the original RBB structure, and the execution path of residual is “Input x → Convolution layer → BN → ReLU → Convolution layer → BN → Output F(x)”. Directly performing convolution operation on the data will increase the computational complexity of network training and slow down the training network. He et al. [24] improved the original residual module and proposed Pre-activated Residual Building Block (PARBB), the specific structure is shown in Fig. 1b. The execution path of residual is “Input x → BN → ReLU → Convolutional layer → BN → ReLU → Convolutional layer → Output F(x)”. The pre-activated residual approach puts the BN layer and the activation function before the convolution layer, which expands the range of constant mapping, makes the transmission of information smoother and the network more stable, and helps to improve the feature learning capacity of the network. The computation of the residual module is expressed as follows:

$$ y = x + F(x) $$
(4)

where x denotes the input of RBB, y denotes the desired output of RBB. The residual, denoted as F(x), signifies the disparity between the desired output and the input.

Fig. 1
figure 1

Different types of residual unit structures

3 Proposed method

3.1 Hybrid convolutional residual module

HSI is a 3D cube image whose rich and fine spatial and spectral features can be extracted using CNN. When using 2DCNN as the HSIC model, the original HSI needs to be preprocessed, leading to a decrease in spectral dimensional information. 3D convolution can directly take HSI as the input to the network without complex preprocessing, and it can extract spatial and spectral features at the same time. However, 3DCNN needs more parameters to learn, which has the limitation of high computational complexity. And there is a considerable amount of redundant information and noise in HSI band, so the 3D convolution alone does not yield ideal HSIC results. To address these problems, we combine 2D convolution, 3D convolution and PARBB to design a hybrid convolution residual module (HCRM), which can effectively make up for the defects of incomplete feature extraction and improve the computational efficiency. The structure of HCRM is given in Fig. 2.

Fig. 2
figure 2

Hybrid convolutional residual module (HCRM)

Conv3D and Conv2D denote 3D convolutional layer and 2D convolutional layer, respectively. BN denotes batch normalization. ReLU is activation function. The first Conv3D is used to implement spatial–spectral joint features learning on the input. Then, in branch a, to make the data dimensions satisfy the requirements of Conv2D, the feature map is concatenated in the spectral and channel dimensions and thus reconstructed (reshape) from a 4D tensor to a 3D tensor, which is fed into a PARBB for deep feature learning. In branch b, to enhance the information transfer, the further extracted spatial–spectral features from the second Conv3D are reshaped and element-wise added with the features obtained from branch a. The fused features are then fed into Conv2D to achieve spatial information enhancement after which we get feature A. On the other hand, to reduce the complexity and speed up the convergence, the features are first reshaped into a 3D tensor and then input into another PARBB to get the feature B in branch c. Finally, feature A and feature B are fused by element-wise adding and reshaped into a 4D tensor as output. HCRM is able to strengthen the ability to learn spatial–spectral joint features and features of different abstraction levels while ensuring the effective delivery of information, which in turn improves the HSCI effect.

3.2 HSIC model based on 2D–3D hybrid convolution and pre-activated residual

Figure 3 shows the basic framework of the proposed PMCRNet. It mainly consists of five parts, feature extraction, HCRMs, feature fusion, full residual learning, and classification.

  1. 1.

    Feature extraction. This part is composed of two Conv3D. The first one is composed of 32 convolutional kernels of size 3 × 3 to extract shallow features, which is beneficial to retain more positional and detail information. The second one adopts 32 convolutional kernels of size 3 × 3 to capture deeper information of the HSI. The result of a single 3D convolution kernel on a block of 3D HSI data is a 3D tensor, whereas the feature map extracted by multiple 3D convolution kernels can be regarded as a 4D tensor, and thus the final output of the feature map in this part is a 4D tensor.

  2. 2.

    HCRMs. After the feature extraction part, higher-order features at different semantic levels are further extracted by three HCRMs. Each HCRM is a tandem structure consisting of two Conv3D and two Conv2D. Based on the PARBB, the first Conv3D span connects the first Conv2D and the second Conv3D span connects the second Conv2D. This design not only mines the useful spatial–spectral information in HSI at a deeper level, but also reduces the computational complexity.

  3. 3.

    Feature fusion. To reduce the loss of HSI features and enhance the comprehensive representation of features at different levels, the features output from each HCRM are input into the Concat layer for splicing. Then, a Conv3D consisting of 32 convolutional kernels of size 3 × 3 is used as a transition layer to ensure that the number of features outputted by the network is the same as the number of shallow features, so as to obtain fused features with stronger representation capability.

  4. 4.

    Full residual learning. It consists of both pre-activated short-range residual connections and long-range residual connections. The use of short-distance residual connections can alleviate the training problem of the deep network. The long-distance residual connection is used to fuse the output of the feature extraction part and the output of the feature fusion part to achieve the effective extraction of deep features and shallow information. The above structure can supplement the loss information to a certain extent and improve the fusion performance of PMCRNet, thus obtaining better HSIC results.

  5. 5.

    Classification. First, a Conv2D with 32 convolution kernels is used to convolve the output feature map to improve computational efficiency. Then, the output is subjected to a maximum pooling operation to retain the most influential factors in the feature region, thus effectively avoiding information loss. Finally, the output is processed through the fully connected layer to get the HSIC result by Softmax function. The function is obtained by

    $$ {\text{soft}}\max (c_{h} ) = \frac{{e^{{c_{h} }} }}{{\sum\nolimits_{d = 1}^{D} {e^{{c_{d} }} } }} $$
    (5)

where \(c_{h}\) is the output of the hth node, D represents the count of output nodes, specifically the count of categories designated for classification. In addition, we also employ Dropout to effectively prevent the overfitting phenomenon, and at the same time, it can reduce the dependence on local features and enhance the generalization ability of PMCRNet.

Fig. 3
figure 3

The structure of PMCRNet

4 Experiments and analysis

4.1 Dataset description

In order to validate the effectiveness of PMCRNet proposed in this paper, experiments are conducted using three publicly available HSI datasets, Indian Pines, Pavia University and Salinas. The details are shown in Table 1.

Table 1 Description of three datasets

4.2 Experimental environment

The hardware environment utilized for the experiment is Intel Core i7-13700F processor, 32 GB RAM, RTX3090 24 GB graphics card. The software environment is based on Keras as the main deep learning framework. The compiler is Pycharm, and the compilation environment is Python3.6.

4.3 Selection of evaluation indicators and experimental data

Overall Accuracy (OA), Average Accuracy (AA) and Kappa coefficient are introduced as evaluation indicators. OA is the proportion of the number of samples in which the predicted category is the same as the actual category to the total sample size and can be expressed as:

$$ {\text{OA}} = \frac{{\sum\nolimits_{i} {n_{ii} } }}{{\sum\nolimits_{i} {N_{i} } }} $$
(6)

where \(n_{ij}\) represents the number of samples of class i in the image that were incorrectly predicted as class j. \(n_{ii}\) represents the number of correctly classified samples in class i.\(N_{i} = \sum\nolimits_{j} {n_{ij} }\) is the total number of samples of class i to be classified. OA can be a good assessment of classification effectiveness, but for multi-category classification when there is an unequal distribution of categories in the dataset, those categories with more samples have a greater impact on the OA value.

AA is the ratio of correctly predicted samples in each category to the total number of samples, which reflects the classification of each category in an integrated way. It can be expressed as:

$$ {\text{AA}} = \frac{1}{k}\sum\nolimits_{i} {\frac{{n_{ii} }}{{N_{i} }}} $$
(7)

where k represents the number of categories of the sample to be classified.

Kappa coefficient is a measure of consistency and can also be employed to evaluate the accuracy of classification. The degree of agreement between the actual classification results and the predicted results is what determines consistency. It can be expressed as:

$$ {\text{Kappa}} = \frac{{\sum\nolimits_{i} {N_{i} \sum\nolimits_{i} {n_{ii} - \sum\nolimits_{i} {\left( {\sum\nolimits_{j} {n_{ij} } \cdot \sum\nolimits_{j} {n_{ji} } } \right)} } } }}{{\left( {\sum\nolimits_{i} {N_{i} } } \right)^{2} - \sum\nolimits_{i} {\left( {\sum\nolimits_{j} {n_{ij} } \cdot \sum\nolimits_{j} {n_{ji} } } \right)} }} $$
(8)

where a higher Kappa coefficient represents a better classification effect of the model, i.e., the samples are less likely to be missed and misclassified.

The proportion of training set samples for IP is 5% and the remaining samples are the test set. The proportion of training set samples for PU and SA is 1% and the remaining samples are the test set. In order to ensure the randomness of the samples, the dataset division is to randomize each class of samples and then extract them according to the proportion. The experiment was repeated 10 times with a randomly divided dataset and then the average was taken as the final classification accuracy of our experiment.

4.4 Experimental parameter setting

We utilized Adam to optimize the loss function. Adam is an extension of stochastic gradient descent algorithm, which saves memory space and is computationally efficient. We chose a learning rate value of 0.001, batch size of 32, epoch of 600, and a ratio of 0.5 neurons removed from the Dropout layer in full connectivity.

Spatial neighborhood information has a significant impact on the HSIC accuracy. Choosing too small a spatial neighborhood can result in not obtaining discriminative features for key spectral bands, while too large a spatial neighborhood may introduce noise. The effect of spatial neighborhood size on the HSIC results is shown in Fig. 4. With the increase of spatial neighborhood size, the accuracy shows an increasing trend. However, when the size increases to a certain range, the accuracy shows a decreasing trend. For IP, PU and SV datasets, the optimal spatial neighborhood size is 15 × 15, 19 × 19 and 17 × 17, respectively. The above results indicate that when the ground objects in a HSI occupy a larger area, it contains more information, so choosing a relatively large spatial neighborhood is more conducive to preserving the spatial information of each category of the ground object.

Fig. 4
figure 4

Impact on spatial neighborhood blocks of different scales

4.5 Ablation experiment

To validate the effectiveness of the 2D–3D-CNN long connectivity-based module and the pre-activation residual module, ablation experiments were performed on the three datasets. Using 2D–3D-CNN as the baseline model, the influence of the classification performance of different components in the proposed model is discussed by separately and fully introducing the long-connection module and the pre-activated residual module. The structure and experimental results of the four models are shown in Table 2, where model D is the proposed model.

Table 2 Comparison of the structure and ablation results of the four models

From Table 2, it can be found that model D exhibits the best performance on all three datasets, with OA above 99%. Model B improves OA on all three datasets compared to model A, proving that long connection approach improves the sample classification performance. Model C achieves a small improvement in classification accuracy on the three datasets. The above experimental results illustrate that when the pre-activated residual module and the long-short connection are introduced into the 2D-3D hybrid model, the performance and classification ability of the model can be improved to a certain extent.

4.6 Experimental results and analysis

To verify the HSIC effect of PMCRNet, we conducted experimental comparisons using 2DCNN [8], 3DCNN [9], MCNN-CP [10], BHModel [17], JPModel [18], SSMRN [19], MS-CapsNetW [20] and PMCRNet. To ensure fairness, all experiments were conducted under the same settings, and the network parameters of the compared methods are kept consistent with the references.

Tables 3, 4 and 5 show the detailed classification results on the three datasets, respectively. It can be seen that the accuracy of 2DCNN is low, and there is a large gap between it and other methods on all three datasets due to the insufficient learning ability. The accuracy of 3DCNN is better than that of 2DCNN. In the three datasets, comparing the results of 3DCNN with 2DCNN, OA is 4.05%, 4.23% and 4.45% higher, AA is 1.85%, 5.89% and 4.29% higher, and the Kappa coefficient is 4.63%, 5.62% and 4.95% higher. This indicates that 3D convolution has a certain advantage in terms of spatial–spectral joint feature mining capability. Compared with the single 2DCNN and 3DCNN, MCNN-CP hybrid convolutional model also improves the HSIC effect on the three datasets, and the OA on the three dataset are 2.86%, 1.79% and 1.43% higher than that of the 3DCNN, the AA is improved by 5.25%, 1.22% and 1.58%, and the Kappa coefficient is improved by 3.24%, 2.28% and 1.59%, respectively, which verifying the potential of the hybrid model in mining HSI features. The OA, AA and Kappa coefficients of BHModel, JPModel, SSMRN, PMCRNet and MS-CapsNetW are higher than those of MCNN-CP, which proving the effectiveness of introducing residual structure.

Table 3 Classification results (%) of different methods on IP dataset
Table 4 Classification results (%) of different methods on PU dataset
Table 5 Classification results (%) of different methods on SA dataset

Among the five residual models, BHModel has the lowest OA due to the simple structural design and limited ability to extract features. On the three datasets, compared to BHModel, the OA of JPModel improved by 0.86%, 1.24% and 0.43%, respectively, the AA decreased by 0.41%, improved by 1.67% and 0.27%, respectively, the Kappa coefficients improved by 0.99%, 1.67% and 0.48%, respectively. The reason is that JPModel combines residuals connection with depthwise convolution, using multiple residuals stacked to continuously extract spatial context features and spectral features of the data cube. Compared to JPModel, SSMRN has 0.99%, 0.8% and 0.76% higher OA, 2.01%, 1.22% and 1.11% higher AA and 1.12%, 1.05% and 0.85% higher Kappa coefficients on the three datasets, respectively. The reason is that SSMRN uses a multi-scale residual structure to capture spectral-spatial information, which can effectively learn the features of the target ground objects. Compared to SSMRN, MS-CapsNetW improves OA by 0.51%, 0.28% and 0.55%, AA by 0.45%, 0.44% and 0.27%, and Kappa coefficient by 0.58%, 0.38% and 0.62% on the three datasets, respectively. The above results indicate that the two-channel residual connection can better increase the depth of the model and extract more advanced and comprehensive features, which in turn improve the HSIC accuracy. Comparing the four residual models mentioned above, PMCRNet has the highest OA, AA and Kappa coefficients, suggesting that when introducing the pre-activated residual structure, more informative features can be utilized effectively.

Figures 5, 6 and 7 show the classification results obtained with different methods on the three datasets. As can be seen, compared with other comparative methods, the HSIC result map of PMCRNet has the least number of misclassified ground objects, and is overall smoother and has only a very few noise points, which is closer to the ground truth map.

Fig. 5
figure 5

Classification results of different methods for IP. a False color image b 2DCNN c 3DCNN d MCNN-CP e BHModel f real ground data g JPModel h SSMRN i MS-CapsNetW j PMCRNet

Fig. 6
figure 6

Classification results of different methods for PU. a False color image b 2DCNN c 3DCNN d MCNN-CP e BHModel f real ground data g JPModel h SSMRN i MS-CapsNetW j PMCRNet

Fig. 7
figure 7

Classification results of different methods for SA. a False color image b 2DCNN c 3DCNN d MCNN-CP e BHModel f real ground data g JPModel h SSMRN i MS-CapsNetW j PMCRNet

4.7 Runtime comparison

To better evaluate the HIC performance of the model, the training time, testing time and the number of parameters of different methods on the three datasets were analyzed through experiments. The number unit of model parameters is M = 106. The number of model parameters and comparison results are shown in Table 6

Table 6 Runtime (seconds) and the numbers of parameters (M = 106) of different methods on the three datasets

From Table 6, it can be seen from the results of training time and test time that PMCRNet takes more time to train than 2DCNN, 3DCNN, MCNN-CP and BHModel, mainly due to that it uses multiple hybrid convolutional blocks and adds long and short residual connections and a pre-activation mechanism to the network, thus increasing the training time. However, PMCRNet consumes less time than JPModel, SSMRN and MS-CapsNetW, mainly because of the higher utilization of the parameters of itself, which can better mine the rich spatial–spectral features in HSI for feature reuse and enhanced information transfer. In terms of the number of parameters, 2DCNN, 3DCNN, MCNN-CP and BHModel can quickly complete the model training due to the simple structure and small number of parameters, but this also leads to the model not fully extracting the features, and the HSIC accuracy is not good. Compared with JPModel, SSMRN and MS-CapsNetW, the proposed method reduces the number of parameters, reduces the computational complexity, and achieves better classification results.

Although PMCRNet does not achieve the optimal time consumption, comprehensive consideration of the HSIC accuracy and the number of parameters and other indicators can be found that the model has a better classification effect and is more suitable for a variety of practical engineering application scenarios.

4.8 Effectiveness of small sample sizes

To further demonstrate the effectiveness of the proposed method, 10,20,30,40,50 samples were randomly selected as training data for the experiments in each category of ground objects in IP, PU, and SA datasets, respectively. Figure 8 shows the comparison of OA of different methods under small sample conditions. It can be seen that our method still achieves the optimal classification results, which proves the robust effectiveness of this method in the case of small sample sizes.

Fig. 8
figure 8

Classification results of small sample sizes. a IP dataset b PU dataset c SA dataset

5 Conclusion

This work combines 2D-3D hybrid convolution and pre-activated residual network to propose PMCRNet for HSIC. Compared with the traditional residual-based methods, PMCRNet can effectively accelerate the network training speed and reduce the parameter computation. In addition, the use of long and short distance residual connections solves the problem of gradient disappearance in the deep network, facilitates back propagation and better integrates the input information of the current layer. The method achieves the enhancement of feature reuse and information transfer, which helps to improve the ground object classification accuracy of HSI. By evaluating the classification effectiveness of the three datasets and comparing it with seven related state-of-the-art models, the results show that the model outperforms similar networks while ensuring higher classification accuracy.