1 Introduction

Advanced spectrometers capture hyperspectral imaging (HSI) with numerous spectral and spatial characteristics. The continuous spectral spectrum extends from visible light to infrared, boosting the visibility of ground objects [1]. Categorizing each pixel, HSI has found widespread application in mineral exploration, precise agriculture, and environmental monitoring. HSI has been broadly classified as either spectral, spatial, or hybrid exploitation of spatial and spectral information [2]. Since each ground item has a unique spectral characteristic, the spectrum-based method converts into a short pattern recognition that identifies spectral vectors using a classifier [3]. However, external factors such as lightning, environment, and atmosphere influence the generation of spectral vectors and create noise, or so-called spectral variability, leading to substandard performance [4]. To smoothen the spectral difference of ground objects, the spatial information described in [5] is often considered, and numerous techniques based on joint spectral-spatial information have been published [6,7,8,9,10,11,12].

Several handcrafted feature-based HSI classification approaches exist, including the k-nearest neighbor [13, 14], Bayesian estimation approach [15], multinomial logistic regression [16, 17], and support vector machine (SVM) [18,19,20,21]. These approaches are incapable of noise suppression and lack spatial-spectral characteristics. The spatial variability of spectral information [22] and extracting discriminative and most informative characteristics [23] remain substantial obstacles in HSI. Moreover, several band-reduction-based approaches, such as linear discriminant analysis (LDA) [24,25,26], independent component analysis (ICA) [27], and principal component analysis (PCA) [28, 29], fail to exploit the spatial correlation between pixels effectively. The use of deep convolutional neural networks, which can automatically extract high-dimensional spatial and spectral characteristics, has allowed researchers to overcome these obstacles.

The spatial and spectral properties of 3D HSI were extracted as 1D features using the stacked autoencoders (SAEs) and a deep belief network (DBN) Chen et al. [30, 31]. This was achieved at the expense of a great many spatial details. The Classification performance was improved using five layers of 1D-CNN to extract spatial information [32]. Before extracting spatial features from HIS with 2D-CNN, principal component analysis (PCA) was used in [33, 34] to minimize the dimension of HIS. By flattening the features, a dual branch of 2D and 1D CNN layers allowed for the joint exploitation of spatial and spectral characteristics [35]. In [36], a 3D convolutional neural network (CNN) model was used to improve classification accuracy using spatial and spectral information. Still, the enormous number of trainable parameters caused the computation cost to skyrocket. Later, [37, 38], using 3D and 2D CNN layers lowered the computational cost.

As shown by [39], dilated convolutional-guided feature filtering can help reduce the model's loss during training and validation. This strategy lowers spatial feature loss without diminishing the receptive field and can obtain distant features that boost classification performance. In [40], the residual connection-based SSRN model was used to exploit the spatial and spectral information, where the residual connection was added to each 3D layer, followed by batch normalization. But, due to the 3D layer and residual block, the computational cost was considerable. Multi-branch 3D CNN was used in [41] with an attention module for HSI object classification. However, more trainable parameters are needed when more 3D layers are used, which raises computation costs.

Recently, a powerful deep learning method called the transformer network was introduced to address natural image categorization from sequential data [42]. Transformer networks are superior at analyzing sequential data because they employ self-attention methods, unlike CNNs and RNNs. This presents a novel approach that can effectively be utilized for the HSI image land cover categorization. It is well-known that the self-attention technique is the central module in transformers and can capture global information by encoding position. Although they address the long-term dependence of spectrum properties, they lack spatial-spectral integration data at the local level. Although they solve the long-term dependency of spectrum features, they miss local spatial-spectral integration data. In addition, local texture data and positional information loss occur because current transformer networks progressively encode spatial features via the flattening technique and linear projection. A 3D-Swin transformer-based technique was used in [43] to represent semantic-level images. The proposed technique used numerous transformer blocks to improve performance but considerably increased the calculational costs. Later, the authors of [44] used the SSFTT approach to reduce the computing cost of the 3D-Swin transformer-based technique by employing one 3D and two 2D layers followed by a transformer module, allowing extraction of global and local characteristics. Yet, classification performance in several classes may be improved. In [45], the fusion of convolution and transformer block in one technique was applied to improve the classification. The proposed method used parallel linear convolution blocks and transformer blocks to collect local and global data. The HSI cube is first turned into a sequence and then handed to the transformer block to collect the local and global features. This minimized the computational cost but at the expense of diminished performance.

To address the aforementioned issues associated with the HSI classification problem, a novel deep learning model called HFTNet is developed in this paper based on a dual block transformer. HSI's local spectral and spatial information is extracted via a 3D convolution block and 2D convolution layers based on the network architecture. As a result, improved classification performance is achieved by extracting global high-level semantic features using a dual-block transformer network. The significant contributions of the proposed method are as follows.

  1. (1)

    Initially, a 3D convolution layer is incorporated to focus on extracting spectral features, followed by implementing network-in-network structured 2D convolution layers explicitly designed to capture and analyze spatial features.

  2. (2)

    Integration of a transformer module is critical for spectral and spatial features. This module establishes local and global correlations within the data. The design facilitates a dual-pathway approach that enhances the representation of global semantic features and local pixel-level details.

  3. (3)

    The next stage involves sophisticated semantic and pixel pathways integration. This integration strategically distributes self-attention information across both pathways. Furthermore, the transformer's computational costs are reduced to optimize efficiency by splitting the query between the local convolution block and the transformer module.

  4. (4)

    The final step in this architecture involves a synergistic fusion of the CNN network with the dual block transformer module. This fusion technique enhances the overall classification accuracy significantly. The efficacy and superiority of HFTNet are experimentally proven through rigorous testing across four distinct datasets.

The rest of the paper is organized as follows.

In Sect. 2, we have discussed the proposed method, whereas Sect. 3 describes the quantitative and visual results. Finally, in Sect. 4, the conclusion is discussed in detail.

2 The proposed model

The proposed method system flow has been illustrated in Fig. 1. Let the hypercube of the HSI is \(I \in R^{M \times N \times B}\), where M, N represents width and height, and B is the total bands. Each pixel in I contains the spatial and spectral feature, and their one hot encoding is given by \(H = \left\{ {h_{1} ,h_{2} , \ldots h_{C} } \right\}\), where C is the different objects of land cover. In HSI, several continuous bands containing rich sets of spectral information are available due to the high number of bands, computation cost and redundancy increase. To overcome this problem, principal component analysis (PCA) is applied over band B. At the same time, maintaining the same spatial information. Let after PCA total band is D and the hypercube is represented by \(Y \in R^{M \times N \times D}\).

Fig. 1
figure 1

The architecture of the proposed HFTNet

The proposed method extracts the spectral and spatial features using 3D and 2D convolution layers. We added one 3D CNN layer and the 2D CNN layers to extract spectral and spatial features. We have not included several 2D CNN layers since the labeled training data is less, which may lead to overfitting. Therefore, based on the depth-wise separable method [46], CNN layers are utilized, which can enhance the performance and reduce the computation cost. The depth-wise 2D CNN layers filter per input is defined as.

$$ \mathop Y\nolimits_{h,w,b}^{*} = \sigma \left( {\sum {\mathop K\nolimits_{i,j,b} \mathop Y\nolimits_{h,w,b} + \mathop A\nolimits_{i,j,b} } } \right), $$
(1)

where \(\mathop Y\nolimits_{h,w,b}^{*}\) = Features map, \(\mathop K\nolimits_{i,j,b}\) = Convolution kernel, \(\mathop A\nolimits_{i,j,b}\) = Bias.

2.1 Vision transformer preliminaries

The concept of vision transformer (ViT) was first used for Natural Language Processing (NLP) [42]. Later, this technique was extended to other fields like image classification, segmentation, object detection and image captioning. In ViT, one sequence is transformed into another with the help of an encoder and decoder module. The ViT encoder takes an input image and produces output results.

2.1.1 The self-attention encoder module

When connecting various locations within the same series, the self-attention technique could calculate a projection of such an input data sequence [41]. The self-attention network represents the encoded structure and multi-layer perceptron (MLP) block, where each block uses the normalization layer with residual connections. A set of keys, value pairs, and a query are mapped to output using the attention function [47]. The sequence accessibility function and the appropriate key generate the weights assigned to each value, and the output is produced by adding the weighted total of the values. To learn different meanings, three learnable weight matrices \(M_{q}\),\(M_{k}\) and \(M_{v}\) are created in advance, and tokens are linearly mapped to 3-D-invariant matrices, containing queries q, keys k, and values v. Finally, the attention score of each q and v is calculated using Softmax activation as shown in Eq. (2).

$$ {\text{SHA}} = {\text{Attention}}(q,k,v) = {\text{Soft}}\max \left( {\frac{{q \times k^{t} }}{{\sqrt {d_{K} } }}} \right) \times v $$
(2)

The proposed model concurrently attends data from multiple representation subsets located at different locations via multi-headed attention. The computation carried out by the encoder's multi-head self-attention for q, k and v is calculated through the concatenation of each head as follows.

$$ {\text{MHSA}} = {\text{Concat}}({\text{SHA}}_{1} ,{\text{SHA}}_{2} , \ldots ,{\text{SHA}}_{h} )M^{0} $$
(3)

where M is the parametric matrix, h is the total number of heads, and M0 is the parametric matrix.

$$ {\text{SHA}}_{j} = {\text{Attention}}(qM_{j}^{q} ,kM_{j}^{k} ,vM_{j}^{v} ) $$
(4)

The projection of the parametric matrix is defined as follows.

$$ M_{j}^{q} \in R^{{D_{m} *d_{K} }} ,M_{j}^{k} \in R^{{D_{m} *d_{K} }} ,M_{j}^{v} \in R^{{D_{m} *d_{K} }} \,{\text{and}}\,M_{j}^{0} \in R^{{{\text{SHA}}_{J} *d_{m} }} . $$

Once the weight matrix has been learned, it is fed into the MLP block. Two interconnected layers make up the MLP. The nonlinear activation function, the Gaussian error linear unit (GELU), lies between these two layers. Adding an LN after the MLP layer prevents gradient inflating, mitigates vanishing gradient issues, and expedites training. The stacked structure of layers is identical in the model. For instance, let \(F \in R^{m \times D}\) be the token features with dimension D and length m. Mathematically, each block can be defined as follows.

$$ B = \sigma ({\text{FP}}),\mathop B\limits^{\sim } = s(B),A = \mathop {{\text{BQ}}}\limits^{\sim } $$
(5)
$$ b_{0} = \left[ {X_{{{\text{class}}}} :X_{p}^{1} E:X_{p}^{2} ....:X_{p}^{N} } \right] + E_{{{\text{POS}}}} ,E \in R^{{(p^{2} * C) * D}} ,E_{{{\text{POS}}}} \in R^{{(N + 1) * D}} $$
(6)
$$ b_{l}^{m} = {\text{MSA}}\left( {{\text{LN}}(b_{l - 1} )} \right) + b_{l - 1} ,l = 1,2,3 \ldots L $$
(7)
$$ b_{l} = {\text{MLP}}\left( {{\text{LN(}}b^{l} m)} \right) + b_{l}^{m} ,l = 1,2,3 \ldots L $$
(8)
$$ y = {\text{LN}}(b_{l}^{0} ) $$
(9)

F and P stand for the dimensions of the channel’s linear projections, \(\sigma\) stands for an activation function, and s stands for identity mapping.

2.2 Proposed convolutional and transformer block

The FEATURES extracted from the 2D convolutional block generate 2D tokens using Eq. (10). Afterward, tokens are used for input to the dual block. Finally, k, q and v vectors are generated by flattening 2D features y, using Eq. (11).

$$ {\text{Tokens}} = {\text{MaxPool}}({\text{Re}} {\text{LU}}({\text{Conv}}2d(y)). $$
(10)
$$ y_{qkv} = Linear(Flatten(y)). $$
(11)

The query vector q is split into two parts \(q_{a} \in R^{N \times C/2}\) and \(q_{b} \in R^{N \times C/2}\). The vector \(q_{a}\) is passed to the transformer block and \(q_{b}\) to the convolutional block. By doing this, the computation cost in the transformer block was reduced half due to reduced channel size. The convolution block contains several convolution layers of kernel 3 × 3 with step size 1 and padding zero shown in Fig. 2. Each convolution is followed by ReLU activation and Local Response Normalization(LRN).

Fig. 2
figure 2

The architecture of the convolution block

2.2.1 Local response normalization (LRN)

The LRN is a contrast enhancement process for feature maps and reduce the saturation problem of deep CNN. We have used RELU activation function in the convolution block that improves neurons learning capability even on small samples.The learning activity of \(x_{x,y}^{i}\) neurons can be evaluated at a place (x,y) through j, for the generalization of the resources. The LRN can be calculated using the formula as shown below.

$$ {\text{LRN}}_{x,y}^{i} = {\text{Ne}}_{(x,y)}^{i} /\left( {t + {\text{Ne}}\sum\limits_{j - \max (0,j,n/2)}^{\min (N,1,i + n/2)} {({\text{Ne}}_{(x,y)}^{i} )^{2} } } \right)^{\beta } $$
(12)

where \(N =\) Total numbers of channel and \(t,x,n,\beta\) = hyper-parameters. Before passing \(q_{a}\) to the model, it is reshaped using Eq. (13) to match the dimension with the convolution block. The \(q_{b}\) is reshaped to 2D using Eq. (14), Then fed to the convolution block.

$$ {\text{Attention}}(y) = {\text{Re}} {\text{shape}}\left( {{\text{Soft}}\max \left( {\frac{{q_{a} \times k^{t} }}{{\sqrt {d_{K} } }}} \right) \times v} \right). $$
(13)
$$ {\text{Conv}}(y) = {\text{BatchNormalization(Conv}}2D(q_{b} )). $$
(14)

Finally, global features and local features are obtained through the transformer and convolution block, and these features are concatenated to form a pool of features vector as shown in Eq. (15).

$$ F = {\text{concat}}({\text{Attention}}(y),{\text{Conv}}(y)). $$
(15)

In the classical ViT, the query is directly passed to the MHA to attain the global correlation of the features. Due to this, computation costs are high. The proposed HFTNet divides the query vector into two parts to reduce the computation resources. The local and global correlation of the spatial and spectral features is achieved through convolutional and MHA blocks. Further, enhanced features are obtained by fusing the features acquired via convolution and transformer modules.The working of the conventional transformer and proposed dual block transformer is shown in Fig. 3.

Fig. 3
figure 3

Working illustration of conventional and proposed dual block transformer

The feature vector F and the feature set y is passed to the Softmax function that convert logits into probabilities [48]. The land cover class is determined by setting the value of k = {9, 16, 16, 13} for the UP, IP, SV and KSC datasets respectively and labeling is performed using variable L. A bias value \(w_{0} y_{0}\) included in each iteration to classify the land covers.The probabilities of class is calculated using Eq. (16).

$$ P\left( {y = L|F^{(j)} } \right) = \frac{{e^{{F^{(j)} }} }}{{\sum\limits_{L = 0}^{k} {e_{k}^{{F^{(j)} }} } }} $$
(16)

where \(F = w_{0} y_{0} + w_{1} y_{1} + w_{2} y_{2} .......... + w_{k} y_{k}\).

The algorithm of the proposed method is shown below.

Algorithm 1: Proposed HFTNet Method

figure a

3 Result analysis

3.1 Dataset description

In the proposed study, we have implemented HFTNet on four benchmark datasets, including the University of Pavia (UP), Indian Pines (IP), Salinas Valley (SV) and Kennedy Space Center (KSC). The first UP dataset was captured using Reflective Optics System Imaging Spectrometer (ROSIS) sensors. It has 115 continuous spectral bands with a spatial resolution of 1.3 m per pixel (mpp) with height and width of 610 and 340, respectively. In the experiment, 103 bands are used after removing the 12 noisy bands.The nine land covers contain 42,776 pixels labeled into nine categories. The second IP datasetwas collected from the Indian Pines test site in North-western Indiana by AVIRIS sensors.The land cover contains 16 types of objects with a spatial resolution of 20mpp with a size of 145 × 145 pixels. The 20 water absorption bands (104–108, 150–163, and 220) are removed, and 200 bands are used for the experiment.

The scene of the third SA dataset was also collected using AVIRIS sensors over Salinas Valley, California, which has 224 spectral bands with a spatial resolution of 3.7 mpp. The 20 water-absorbing bands (108–112, 154–167 and 224) are removed, and 204 bands of spatial size 512 × 217 pixels with 16 classes are utilized in our experiment.The last KSC dataset was captured using AVISRIS sensorsover the Kennedy Space Center (KSC), Florida. The spatial size of 512 × 614 pixels with a spatial resolution of 20 m is used in the experiment. After removing 48, water absorption and low signal-to-noise ratio 176 bands were adopted for the analysis. A details description of each dataset is provided in Table 1 [51, 52].

Table 1 Details of the sample in each land cover with their ground truth and color map

3.2 Experimental setting and performance indicators

The proposed method is implemented in Python environment on the window10 operating system (OS) with 128 GB RAM and NVIDIA Geforce TITAN X4000 with a dual GPU of 8 GB. First, bands of each dataset is reduced to 30 using PCA. After that, model was trained for 100 epochs using an Adam optimizer with an initial learning valueof 0.0001 and a batch size of 64. For UP, SA,and KSC dataset samples are randomly split and 5%is used for training. Due to few samples in several class of IP dataset 10% samples are used for training.

To evaluate the quantitative performance of the model overall (OA), average accuracy (AA) and Kappa coefficient (Kc) and class-wiseclassification accuracy of each land cover is calculated based on the confusion matrix [CMtp]. Where [CMtp] denotes the number of testing pixels whose true label is t and predicted label is p. [CMtp] can be defined as.

$$ {\text{CM}}_{{{\text{tp}}}} = \sum\limits_{k = 1}^{K} {1\left( {y_{k} = t} \right)1\left( {y_{k}^{*} = p} \right)} $$
(17)

where \(K\) = Total testing samples, \(y_{k}\) = True label and \(y_{k}^{*}\) = Predicted label.

The OA accuracy refers to the total number of correctly predicted samples and it is formulated by the following equation.

$$ {\text{OA}} = \frac{1}{K}\sum\limits_{t = 1}^{T} {{\text{CM}}_{{{\text{tp}}}} } $$
(18)

The AA is used to calculate the mean accuracy of all per class and it is defined as

$$ {\text{AA}} = \frac{1}{K}\sum\limits_{t = 1}^{T} {\frac{{{\text{CM}}_{{{\text{tp}}}} }}{{\sum\limits_{p = 1}^{P} {{\text{CM}}_{{{\text{tp}}}} } }}} $$
(19)

The Kc measures proportion of error caused by the ground truth map and final classification map.

$$ {\text{Kc}} = \frac{{\frac{1}{K}\sum\nolimits_{t} {{\text{CM}}_{tt} - \frac{1}{{K^{2} }}\left( {\sum\nolimits_{p} {{\text{CM}}_{{{\text{tp}}}} } } \right)\left( {\sum\nolimits_{p} {{\text{CM}}_{{{\text{pt}}}} } } \right)} }}{{1 - \frac{1}{{K^{2} }}\left( {\sum\nolimits_{p} {{\text{CM}}_{{{\text{tp}}}} } } \right)\left( {\sum\nolimits_{p} {{\text{CM}}_{{{\text{pt}}}} } } \right)}} $$
(20)

3.3 Comparative performance evaluation

To demonstrate the effectiveness of the proposed method seven classical method are selected, namely 1DCNN [32], 2DCNN [33], 3DCNN[36], HybridSN[37], SSRN[49], SSFTT[40]and MBDA [44]. For all the methods experiment is conducted according to the setting and parameters mentioned in the article. The 1DCNN consists of five weighted layers: input, convolution, max pooling, fully connected and classification layer. It contains 20,1D convolutionskernelswith an output size of 128. For the classification of LULC, a Softmax activation function was added on the top layer of the 1D CNN. Following the conventional CNN architecture, 2DCNN is equipped with three convolutional layers of size 8, 16 and 32, followed by a max-pooling layer, batch normalization and ReLU activation. The 3DCNN network consists of 3D convolutional followed by batch normalization and max-pooling layers. The size of 3D convolution blocks is 8, 16, and 32, respectively with a filter of size 3 × 3 × 3. HybridSN consists of three 3D convolution layers of size 8, 16 and 32. After the 3D block, a 2D convolutional layer of size 64 was included in the model. Each 3D and 2D block contains a filter of size 3 × 3 × 3 and 3 × 3 respectively. In SSRN, separate spectral and spatial blocks contain skip connections of 4 convolutional layers and two identity mapping. After two consecutive 3D convolutional layers, a residual linkin the spatial block. The SSFTT network contains one 3D and one 2D convolution block and transformer module. At the top, a Softmax layer is added for the classification.

3.4 Quantitative results

The experimental results of the HFTNet on four datasets are demonstrated in Tables 2, 3, 4, and 5. The performance measures AA, OA and Kappa for every class in each dataset have been evaluated [50]. We can notice in Table 2 that 1DCNN performs poorly in all the classes but slightly improved performance results in 2DCNN. However, a few classes’ performances could be more optimal due to missing spectral information. The 3DCNN improves further performance, but the computation cost is high. The HybridSN method used 3DCNN and 2DCNN to extract spectral and spatial features. The Metal Sheet class accuracy 99.62% is highest by HybridSN. The SSRN method exploits spatial and spectral features using a 3D convolution layer with residual attention. This model classification accuracy is higher compared to HybridSN model for several classes. The ViT-based model SSFTT outperforms Asphalt and Meadows.

Table 2 Performance comparison on UP dataset
Table 3 Performance comparison on IP Dataset
Table 4 Performance comparison on SV Dataset
Table 5 Performance comparison on KSC dataset

However, other classes' performances need further improvement. The MBDA method computation cost is high due to multiscale 3D-CNN layers, but classification accuracy improvement can be seen in Meadows, Gravel and Bricks class. The proposed method performance is highest for Meadows, Tress, Bare-soil, Bitumen, and Shadows classes with less computation cost due to the use of only one 3D-CNN layer and Dual block vision transformer. Similarly, in Table 3, we can notice HFTNet, SSRN and MBDA performance is 100% for Hay-Windrowed class. For small sample classes like Alfalfa, Grass-Pasture-Mowed and Oats proposed model achieved the highest accuracy of 97.85%, 100% and 90.18%, respectively. In addition, HFTNet obtained very close performance compared to other methods.

In Table 4, SSFTT, MBDA and HFTNet achieved identical performance due to the large sample in each class. In some classes, HybridSN obtained the highest classification accuracy. For the KSC dataset, out of 13 classes, the proposed method achieved the highest classification accuracy in 10 classes.

The SSRN method obtained the highest performance in one class, and the MBDA method achieved the highest in three classes, as shown in Table 5. In summary, the proposed method works well on a small sample for other classes having large sample sizes, and HFTNet achieved identical performance. This confirms that adding a dual block transformer enhanced the feature selection process and improved the classification accuracy with less computation cost.

3.5 Effect of patch size on performance

In the proposed study, we conducted experiment on 9 × 9, 11 × 11, 13 × 13, 15 × 15 and 17 × 17 patch size. For smaller patch, 9 × 9 model performance is poor in all the datasets. However, classification performance improved as the patch size increases. Maximum value of AA, OA and kappa was obtained with the patch size of 15 × 15. Further, increasing patch size reduces the classification accuracy as shown in Fig. 4.

Fig. 4
figure 4

Illustration of patch size on performance for UP, IP, SA and KSC dataset is shown in ad respectively

3.6 Visual results

The visualization map of several methods is shown in Figs. 5 and 6. In Fig. 5, we can notice classification map of 1DCNN, 2DCNN, 3DCNN and SSRN is poor for the Meadows and Bitumen class in UP dataset. The 1DCNN and 2DCNN contains vast noise due to this objects are not classified accurately in all the datasets. Whereas, SSFTT, MBDA and proposed HFTNet is very close to ground truth map. For Metal Sheets class HybridSN and HFTNet achieved similar visualization map. On IP dataset, again the visual performance of 1DCNN, 2DCNN and 3DCNN is inferior in several classes, as shown in Fig. 7. The Grass-Free land cover class visualization map of 3DCNN, SSRN, MBDA and HFTNet is similar to the ground truth map. Since these methods can suppress the noise. Furthermore, for Alfalfa, Grass-Pasture-Mowed and Oats class, the HFTNet visualization map is closer to ground truth than other methods.

Fig. 5
figure 5

Classification map visualization of different methods on IP dataset. a False color image b Ground truth map c 1DCNN d 2DCNN e 3DCNN f HybridSN g SSRN h SSFTT i MBDA and j HFTNet

Fig. 6
figure 6

Classification map visualization of different methods on UP dataset. a False color image b Ground truth map c 1DCNN d 2DCNN e 3DCNN (f) HybridSN g SSRN h SSFTT i MBDA and j HFTNet

Fig. 7
figure 7

Classification map visualization of different methods on SV dataset. a False color image b Ground truth map c 1DCNN d 2DCNN e 3DCNN f HybridSN g SSRN h SSFTT i MBDA and j HFTNet

In Fig. 8, we can see the visual representation for classes one, two, three and four of the proposed method is very similar to the ground truth map. Due to the semantic global and local spatial and spectral features. However, for class five 3DCNN and HybridSN obtained slightly better maps. Again, for classes, six and eight MBDA and HFTNet achieved the best visualization map. For other classes, the proposed method classification map is very close to the ground truth. In the KSC dataset visualization map of class one, HFTNet is much better than other methods. For classes two and three, MBDA achieved the best visual map. The remaining class classification map of the proposed method is close to the ground truth map. In short, the visualization map of the HFTNet on UP, IP, SA and KSC datasets is much better than other methods in several classes due to the improved spatial and spectral features obtained through convolutional and transformer blocks.

Fig. 8
figure 8

Classification map visualization of different methods on KSC dataset. a False color image b Ground truth map c 1DCNN d 2DCNN e 3DCNN f HybridSN g SSRN h SSFTT i MBDA and j HFTNet

3.7 The training and validation time comparision

We investigated the training and test time on four datasets with the same experimental settings. As we can see in Table 6, the training and validation times of the 3DCNN [36] and MBDA [44] are relatively high. However, 1DCNN [32] and 2DCNN [33] take less computation time, and the SSFTT takes the least training and test time [40] and HFTNet. However, the training and test time of SSFTT is high compared to HFTNet. In the SSFTT, the query vector is directly passed to the MHA, but in HFTNet, we divided the query into two parts and passed it to the convolution and transformer blocks. This confirms that HFTNet can be used for real-time hyperspectral image processing.

Table 6 Training (m) and test time(s) of the HFTNet on four datasets

3.8 Industrial applications of the proposed method

HFTNet can be used for automated quality control in the manufacturing industry. It is perfect for identifying minor flaws or irregularities in commodities, including electronics to automobile parts, because it can analyze spatial and spectral properties in-depth. Further, it can be used for early cancer and several severe disease detection. The high-dimension features extracted by the model can help with anomaly detection, early diagnosis, and predictive analytics for patient care. In addition, HFTNet can analyze satellite and aerial imagery in agriculture to track crop health, forecast yields, and identify plant illnesses because of its ability to comprehend the images' local and global features. The proposed approach can be used to analyze geographic and environmental data for environmental applications. This involves maintaining track of land use changes, monitoring deforestation, and evaluating the condition of natural ecosystems. Other applications can be security and surveillance. The sophisticated feature extraction can improve surveillance systems in security applications. It can also be used for facial recognition and crowd analysis to detect security risks.

4 Conclusion

The hyperspectral image (HSI) contains rich spatial and spectral information sets. The traditional CNN based improved the HSI classification performance but lacked semantic global and local features. In addition, computational cost significantly improves due to many trainable parameters. In the proposed study, an HFTNet based on convolution and transformer is proposed that extracts local features from the convolution block and global semantic features from the transformer block. Finally, the features of both blocks are combined, and classification is performed using the Softmax layer. The computational resources are reduced by dividing the query into two parts and passing them through two modules.

Further, the quantitative and visual performances obtained on four datasets are much better than state-of-the-art methods. The effectiveness of the proposed method is tested on four datasets and achieved an accuracy of 99.34% (UP), 97.95% (IP), 99.70% (SV), and 84.23% (KSC). The computation cost of the proposed model is less due to the reduction of the query in the transformer. The high performance of the proposed methods can be used in several industrial applications.

The computational resource requirements of the HFTNet still need reduction as it involves 3D convolution layers and transformer modules. In addition, the model performance has been evaluated on the open-access datasets, and its effectiveness depends on the quality and volume of the data. The network's performance must be evaluated when data is scarce, noisy, or poorly quality. The sophisticated architecture of HFTNet is beneficial for feature extraction, but it can also increase the risk of overfitting, especially when dealing with limited or particular datasets. Further, the model needs to be tested on the real-time diverse datasets.

In the future study, we will optimize the algorithm to reduce computational requirements, making HFTNet more accessible and efficient for various applications. In addition, model architecture can be refined to improve its ability to handle diverse and limited datasets effectively through advanced data augmentation techniques or transfer learning. Further, the method can be directed toward enhancing the real-time processing capabilities of HFTNet, making it more suitable for applications in dynamic environments. We can utilize the potential to integrate HFTNet with other emerging technologies, such as edge computing and IoT devices, to expand its applicability in real-world scenarios.