Introduction

The growth of the earth’s environment is driven by the growth of human beings. The level of resources and their consumption are increasing rapidly because of the human population. The Earth’s surface has experienced a variety of changes in the past 50 years due to human beings’ exploitation. The changing of land areas from arable to built-up and the extension of urbanisation change the pattern of Land Use and Land Cover (LULC). Land Cover (LC) specifies the spatial variation information of the surface of planet Earth such as vegetation, soil, and water, whereas Land Use (LU) specifies the changes made by human activities or the physical changes on the earth’s surface such as deforestation, urbanization, built-up areas, drought and floods etc. LULC change is an essential part of remote sensing by extracting valuable information, image processing, and classification of spectral signs of Land Cover.

The spatial–temporal analysis of physical surveys conducted in large-scale landscapes is the most difficult task to complete. To alternate the physical survey, modelling techniques were a replacement that could provide the framework by understanding the spatial pattern under various conditions. Although the physical model is more reliant on the prior knowledge of model parameters, this contributes to the model’s poor accuracy. An enormous effort has been made over the last few decades to automate the LULC classification. Recent advances in remote sensing images allow for large data analysis, image classification, processing, and prediction for future changes. Many modelling techniques have been used, such as dynamic, statistical, and Neural Network (NN), which provide realistic simulations that include spatial–temporal, economic, and social aspects (Yuan et al 2020). Machine Learning (ML) modelling has the capability to solve the problems of classification, anomaly-detection, and prediction in remote sensing images. Traditionally, ML algorithms have been used to classify images using maximum likelihood classifiers, MarkovChain model, k-nearest neighbour, Artificial Neural Network (ANN), Support Vector Machine (SVM) (Aburas et al 2019) etc. With the growing number of earth data and the advancement of ML modelling techniques, a novel modelling technique has been presented that can handle enormous volumes of data and better predictive analysis of spatial–temporal aspects via deep learning (DL) (LeCun et al 2015).The DL model has outperformed traditional models in extracting spatial multilevel features extracted from remote sensing images and allows them to provide high performance in image processing and classification (image classification and object classification using Convolutional Neural Network (CNN)) (Zhang et al 2016). Our ultimate goal is to develop a methodical procedure that includes DL methods and produces a reliable result to detect LULC change. The motivation of our study was to conduct an exhaustive survey of DL applications in remote sensing images, including LULC analysis. Through an exhaustive review, we have analysed the research papers of DL approaches in LULC and summarised the main scientific advances in the related work.

Some key findings and research gaps

The main purpose of the review article was to determine the gaps in traditional approaches and analyse the new opportunities in LULC classification. Although in the past years, image classification in LULC using machine learning has made remarkable progress, there are still certain issues that need further study. 1) The ML community has used various algorithms to classify images in LULC classification, but as of now, the data is increasing tremendously and new technology with new datasets has introduced these complexities of classification that cannot be resolved by machine learning. A variety of socioeconomic data is readily available, providing vital material for urban growth. DL is competent to handle that data, which is integrated with socioeconomic data. 2) The feasibility and actual use of the methods in both the image classifications LU and LC have not been explored before, and the features of land use have yet to be resolved due to extremely high intraclass heterogeneity and inter-class similarity. 3) Most of the truth inference algorithms are domain-dependent, so there is a certain scope for creating a domain-independent algorithm. 4) Achieving real-time or near-real-time LULC monitoring systems has become more complex due to changes in the components involved. 5) Forecasting urban land expansion is far more difficult than image analysis. By identifying the driving mechanisms of urban land cover change, significant factors such as the economy, transportation, population, and growth provide important insight into how human activities modify the urban environment. Therefore, setting the benchmark framework in ML model is a challenging task.

Software application for LULC analysis

In the precise study of LULC analysis, we have identified some software tools used in LULC analysis for pre-processing images, classification of images, analysing, and prediction analysis using spectral images. Table 1 contains a list of the software applications that have been identified and are as follows: Google Earth, Pro-ArcGIS, QGIS, ENVI, ERDAS IMAGINE, IDRISI, etc.

Table 1 Software applications used in remote sensing

whereas the structure of the paper consist the “Background” section which follows the background of this work and discusses remote sensing applications in various areas using ML and DL in LULC and the most widely used Machine Learning model with merits and demerits of ML model in LULC. Section 3 highlights the “How DL approaches outperformed ML approaches in LULC classification” and the performance evaluation of traditional classification was improvised by using DL approaches, which outperformed classic ML approaches in terms of performance. furthermore, the deep learning architecture as well as models that can implement an image understanding task for LULC classification. The “Deep Learning framework for LULC classification” section introduces the framework for LULC classification using DL approaches and cutting-edge techniques in remote sensing applications. The “Discussion” section emphasises the statistical analysis of the LULC classification by providing a conclusion and future work outlook.

Scope of the study

In this article, we examine state-of-the-art approaches for LULC analysis with DL techniques. The main outline of the paper is as follows:

  • The ultimate goal of this article is to provide a roadmap for future trends in LULC analysis using DL techniques.

  • Discuss how DL approaches improve performance over the traditional ML approaches.

  • A detailed, comprehensive review of existing DL approaches in remote sensing.

  • A generic framework of LULC change analysis using DL.

  • Finally, we outline the statistical analysis of LULC and provide a conclusion with future research in LULC analysis.

According to the review analysis in Table 6, DL models achieved the best results in terms of classification or prediction.

Finally, some new perspectives on how the DL approaches can provide efficient work for LULC analysis and insights for future research are presented.

Background

Database for earth observation

The selection of image acquisition through a database is the most crucial step in LULC analysis. The comprehensive LULC data repository has been expanded to facilitate the implementation of many policies related to natural resources, food scarcity, deforestation, climate change, agriculture, etc. (Barker et al. 2020) (Xu et al 2018). A Big Earth Observation (EO) dataset is applied to provide the LULC change analysis, and time-series satellite images provide a better understanding of agricultural expansion, deforestation during the particular time period (Petitjean et al 2013). Various datasets used by researchers in LULC analysis are shown in Table 2.

Table 2 Database and its sources

LULC classification in remote sensing using ML and DL models

LULC classification are labelling the pixels in the remote sensing images for creating the classified images. LULC changes are divided into: a) preprocessing, b) change detection approach, and c) accuracy assessment. Atmospheric corrections, multi-temporal radiometric corrections, topographic corrections, geometrical rectification, and image registration are addressed at the preprocessing step. A correction is required to minimise the impact of these features (Song and Woodcock 2003). It is important to evaluate the changing reliance of temporal elements when collecting the remote data for LULC (Lunetta et al 2004). Selecting the appropriate change detection method is the essential step, although several pixel-based and object-based classification techniques that give a wider selection range have been employed (van der Meer 2011). The pixel-based approach provides the classification of a single pixel without considering the spatial context. It is based on the spectral reflectance of a particular LULC category. While comparing medium resolution imagery, it has some limited accuracy, which leads to the noisy output and maximum interclass variance (McRoberts 2014). However, numerous alternative ways have been proposed to overcome the pixel-based technique’s drawbacks. Over the last decade, object-based image classification in LULC has been popular to provide identification through some physical classes (shape, spectra, and texture). It allows the extraction and segmentation of spatial features with the integration of vector and raster based processing. Image segmentation and extraction work on stacked multi-temporal images that include one or more than one spectral transform, multi-temporal images, multi-spectral waveband and texture. Statistical approaches have been used to identify the changes in LULC (Hussain et al 2013a). Accuracy assessment is a conclusive step to measure the remote sensing image classification in LULC.

The Kappa Index is the most commonly used technique for accuracy assessment to provide the correctness of the image classification. Whereas the overall accuracy assessment is used to validate the classification of images (Fan et al 2008). However, other statistical techniques are used to test or validate the performance of the model, such as the fuzzy similarities measure (FSM). Receiver operating characteristic (ROC) analysis is used to assess the simulation of change detection approach with prediction, and average spatial deviation distance (ASDD) is used to evaluate the model’s performance(Almeida et al 2008) (Pal and Ghosh 2017).

Until now, most of the studies have reviewed the articles in different areas, e.g., medical image recognition (Litjens et al 2017), prediction in autonomous vehicles (Miglani and Kumar 2019), speech recognition (Hinton et al 2012) etc. Although several review papers have been published using DL application in remote sensing for image classification, (Ma et al 2019) (Li et al 2018), data fusion, (Liu et al 2018a), atmospheric aerosol, (Di Noia 2018) etc., they ignored the other areas of remote sensing, i.e., LULC. Therefore, this study explored how DL application on the earth’s surface changes the pattern of LULC. Due to the rapid growth in the number of related publications, it is required to conduct a comprehensive review and have a thorough understanding of the DL application in LULC. As shown in the Table 3 the discussion of various ML/DL models in remote sensing applications of LULC.

Table 3 Remote sensing applications of LULC using MLand DL model

An overview of ML model merits and demerits

The archive of current remote sensing data is growing at an exponential rate in terms of quantity, and the planned satellite launches are expected to keep this trend going in the future (dlr.de, 2018). The remote sensing sector has quickly adopted machine learning for a variety of applications. Furthermore, there is a continuing attempt to build an automated system for mapping LULC. The majority of research so far has supported supervised learning techniques, with the notion that LULC is more likely to happen in situations similar to those that produced previous occurrences. Most of the ML algorithms are classified into two groups: supervised, unsupervised, or reinforcement learning. These classifications are based on data types and the requirements of the project, respectively. While working with labelled data, supervised learning methods are often performed in order to forecast the values whereas the values of a continuous set could be predicted using regression, while the category of a discrete set can be predicted using classification. A sample’s value or classification can be predicted using the K-nearest neighbours (kNN) algorithm, which uses the sample’s nearby neighbours in the feature space. In the case of regression, prediction results are calculated by taking an average of the k nearest neighbours’ values. For classification applications, take the class with the highest number of appearances obtained (Altman 1992). For each class, the goal of parametric classification should be to characterise the usual subspace values or distribution associated with that class. Instead, SVM concentrates entirely on the training samples that seem to be closest to the ideal boundary between two classes in terms of their location in the feature space. In SVM, the goal is to determine the ideal border that maximises the distance, or margin, between the support vectors while minimising the support vectors numbers. SVMs were first developed for the purpose of determining a linear class limit (i.e. a hyperplane) (Cortes and Vapnik 1995) (Pal and Foody 2012). One of the most basic and simple classifiers is the Decision Tree (DT). A DT is a mechanism for recursively splitting the input data. The tree like structure is used to illustrate the general framework of tress splits into the branches while splits are represented by branches that show the paths between them, and leaves that represent the final objective values. Classification trees have leaf values that indicate categories of data, whereas regression trees have leaf values that represent one continuous variable after another. Segmentation can be done based on the frequency in a given band which exceeds or falls below a predetermined threshold (Pal and Mather 2003). The weakness of DT is that it decreases the accuracy of the classified training data while pruning the tree. To overcome the limitations of DT, a random classifier (RF) is used to give the final class to each unknown parameter (Belgiu and Dr˘agu¸t 2016). While a single tree may not be the ideal solution, integrating many trees can result in a global optimal solution that overcomes the DT problem. The concept is further developed: each tree is trained by randomly selecting a subset of the training data and employing the corresponding subset of variables. The conjunction of decreasing training data as well as a decreasing number of variables individually can provide the least accuracy of the tree. So the less correlated, the better, making the group more dependable as a whole. The relative relevance of each band may be calculated by comparing the evaluation of trees. In RF, tree pruning isn’t required because of the existence of multiple trees. Regression techniques such as linear, polynomial, and so on are widely used in other areas, but when it comes to classification problems, the logistic regression (LR) and naive Bayesian (NB) classifier have been widely used for a longer time. The NB classifier is used to compute conditional probabilities based on previous probability and the probability is updated based on the ability to do the subsequent task. To normalise the anticipated values, the LR employs the sigmoid function, which calculates the likelihood of an event occurring and compares it to a predetermined value (typically 0.5) can create the projected binary results (Ng and Jordan 2001).

Unsupervised learning methods are frequently used to identify the inherent properties and principles of unlabeled sample data. It has been used in the reduction of dimensionality, grouping, and detection of anomalies. Principal component analysis (PCA) is a technique for generating uncorrelated variables from correlated data. PCA seeks to uncover the most fundamental characteristics of a dataset or to construct a new feature which can represent the novel dataset, hence reducing the dataset dimensionality and increasing its generalization ability while keeping information loss to a minimum level (Jolliffe and Cadima 2016). The basic PCA technique might be used as a simple framework for developing a more operative feature extraction technique. The claim is made that PCA may not be applicable to HSI categorization (Cheriyadat and Bruce 2003; Uddin et al 2021). Due to the HSI’s global variance, it may be unable to extract subtle information from some data distributions. KMeans clustering analysis is a popular technique used widely. It separate the dataset into K distinct, non-overlapping subgroups (clusters), each of the K clusters has a single data point in each of them. It aims to create the data points within a cluster and also to make it as distinct as feasible (Likas et al 2003). The non-linear clustering algorithm that have been used in spatial and non-spatial data are known as Self Organizing Map (SOM). An n-dimensional feature vector is assigned to a neuron in the output layer of this neural network, which has no hidden layers and n weight. The input feature factor is first measured with the similarity index to find out the most similar neurons, and then the nearby and activated neurons’ weights are adjusted with the input vector which is identical. Each feature vector in the input set is subjected to this method. Lastly, it organises the neurons spatially in a one, two, or three-dimensional region where different units are further apart whereas K-means use the nearest neighbour distance, whereas SOM employs the distances between all coupled neurons(Kohonen 2012). Table 4 summarize the merits and demerits of machine learning models in LULC.

Table 4 Merits and demerits of ML in LULC

How DL approaches outperformed ML approaches in LULC classification

Deep learning image classification

Pixel-based classification task involves the semantic segmentation of images which assign the classes to the individual pixels in an image (For example, road, grass, built-up area, etc.). The objective of pixel-based classification is to cluster the pixels of the image that correspond with specific perceptual items are included in that image, hence providing context for the pixels. According to the great degree of similarity in spectral across classes and the heterogeneity within classes, the pixel-based technique does not provide a desirable outcome. In traditional schemes, remote sensing images using pixel-based classification and consider the pure label pixel among the natural targets. On the other hand, object-based classification is a novel paradigm for segmenting remotely sensed images that outperforms pixel-based classification. The spectral information about object is aggregated, whereas textural and contextual information is gathered for classification of image using object-based (Hussain et al 2013b). In the remote sensing domain, new DL models have gained significance over older models. DL approaches outperform almost all other remote sensing techniques in a wide range of applications.

Deep neural network architecture in LULC

Deep neural network architectures like VGGNet, GoogleNet, AlexNet, ResNet, and DensNet have attained tremendous popularity in image classification and semantic segmentation. Using feature extraction in DL techniques, these architectures are very popular and often used for image classification in Table 5.

Table 5 Deep neural network architecture in LULC

AlexNet

(Krizhevsky et al 2012) proposed the AlexNet, which is the first deep CNN architecture for image classification and recognition tasks. The learning capacity of ALexNet has been increased by performing different strategies of parameter optimization. For diverse categories of image dataset, the AlexNet depth has been increased from 5 to 8 layers, which improves the resolution of images. To improve the performance and solve the problem of gradient vanishing, a ReLu activation function has been employed. To increase generalisation by avoiding over-fitting, overlapping sub sampling and local response normalisation were also used.

ZfNet

(Zeiler and Fergus 2014) proposed multi-layer de-convolution neural network, which is known as ZfNet. It was created to analyse the network performance statistically. ZfNet demonstrated that only a limited number of neurons are active, in the first layer some of the neurons are in dormant phase, while in the second layer, the filter size and stride are lowered to the optimum amount of features. It resulted in the improvement of CNN topology to enhance the performance.

VGGNet

(Simonyan et al 2014) suggested a simple and comprehensive design paradigm for CNN architectures that reduced the number of parameters and resulted in a 19-layer deep and 3 × 3 filter architecture with the added benefit of low computing complexity. It achieves superior outcomes when used to solve image classification and localisation challenges.

GoogleNet

(Ioffe and Szegedy 2015) proposed architecture, called Inception-VI, was designed with the primary purpose of providing great accuracy at a minimal computing cost. In GoogleNet, convolutional layers were replaced by small neural network layers in each layer. These small layers have different filters (1 X 1, 3 X 3, 5 X 5) to gather the spatial information, whereas it uses the sparse connection to avoid the problem of redundant information and remove the featured map if it is not important. However, rather than employing a fully linked layer as the final layer, global average pooling was employed to decrease the connection density.

ResNet

(He et al 2016) developed the notion of residual learning in CNN, a highly effective technique for deep network training. The computational complexity of ResNet is lower than that of prior proposed networks. ResNet required less computational time and its depth is 20 and 8 times that of AlexNet and VGG, respectively. ResNet excels at image identification and localisation problems. To visualize the recognition task, spatial depth has been demonstrated in ResNet.

DenseNet

(Huang et al 2017) presented a solution to the problem of vanishing gradients. DenseNet overcame this issue by re-purposing cross-layer connectivity. It connected each preceding layer to the subsequent layer in a feed-forward fashion; hence, as specified in Eqs. 1 and 2, the feature-maps of all preceding layers were used as inputs to all successive layers.

$${Fm}_2^k=f_c(I_c,k_1)$$
(1)
$${Fm}_l^k=f_k(I_k,..,{Fm}_{l-1}^k)$$
(2)

whereas, \({Fm}_{2}^{k}\) and \({Fm}_{l}^{k}\) are the resultant feature map for 1st and l − 1th layer respectively, and fk is a function that enables the cross-layer connection by concatenating the information from preceding layers before to assigning it to the new transformation layer l. Due to this reason, it gains the ability to explicit on distinguishing between information which is contributed to the network.

Convolutional block attention module

(Woo et al 2018) proposed a new type of CNN that is based on attention, termed the Convolutional Block Attention Module (CBAM). CBAM combines average and maximum pooling operations, resulting in a robust spatial attention map. The author has demonstrated that max-pooling may reveal information on object properties that differentiate them, whereas global average pooling can infer feature-map attention. These revised featuremaps improved a feature-capacity map’s ability to be expressed. Due to the protocol’s simplicity, it can be simply integrated into any CNN design.

CapsuleNet

(Arun et al 2019) the proposed technique involves a specific neuron called a capsule that has the ability to determine the face as well as other related information. Many specific capsules combine to create a capsule network called CapsuleNet which has three layers of capsule nodes at the each encoding part. Whereas, 28 × 28 images with 256 filters, and a size of image is 9 × 9 with stride 1. This input is given to the first layer of capsule to produce the vector image rather than a scalar image. Since then, CapsuleNet has performed the accumulation of the preceding layer’s weighted features, which is significantly important in the detection and segmentation processes.

HRNetV2

(Wang et al 2020) proposed architecture, which represents the high resolution for vision tasks. HRNet has two main features. First, a parallel connection is made between the convolution series of high-to-low resolution. Second, information is transmitted frequently throughout resolutions. The benefit attained is a more exact representation in the geographical domain and an extraordinarily rich semantic domain.

DL approaches outperformed ML approaches in LULC classification

Table 6 highlights many examples of DL algorithms for simulating the LULC that outperformed in picture classification, object recognition, semantic analysis, and image segmentation. Allowing for multidimensional analysis in the LULC classification may be important to meet the expanding number and accessibility of remote sensing data. The current studies of remote sensing applications evaluate the effectiveness of DL approaches that employ a variety of data sets with a high spatial resolution and a large number of parameters to achieve a higher degree of accuracy than ML models.

Table 6 DL approaches for simulating the remote sensing applications

Deep learning framework for LULC classification

Deep learning for remote sensing is actively being studied and has a lot of potential. Between 2016 and 2021, significant improvements in DL performance were often observed in Fig. 7. This graph illustrates the growth of published journals of DL in remote sensing. As there is no requirement for human aid in modelling the future LULC analysis, the basic framework for LULC modelling using the DL model performs modelling automatically as shown in Fig. 1. When it comes to learning hierarchical characteristics, DL models offer a wide variety of advantages. LULC categories are primarily expansions or abstractions of the current terrain or landscape. Traditional ML models have been replaced by DL models because they outperform standard models in terms of performance, interpretability, data interpretation, and processing.

Fig. 1
figure 1

An overall framework of DL model in LULC

The design of the overall DL model divides the problem into different modules. 1) Data acquisition: Selecting an appropriate dataset is the most critical step in LULC analysis. The quality of data is necessary for generating a precise result when simulating the LULC. In general, the most relevant data for analyzing land-use change are physical, statistical, dynamical and spatiotemporal data. The HSI/MSI includes aerial images, satellite images, ancillary data, Google Maps, topographical maps, and maps for urban planners and land use. 2) Preprocessing the data-set: The preprocessing stage contains sub-tasks like feature engineering and classifier training, where the input data is prepared for denoise, eliminating irrelevant information from the data, synchronize, fusion of data, reducing its dimensionality, image re-sampling, clipping vector and raster images, buffering and geo-referencing. 3) Train model: After obtaining high-quality training data, it is possible to use this data to train a DL model using the feature extraction technique. 4) Validation and Evaluation: In order to ensure that the trained model is accurate, the model is evaluated and updated as needed. 5) Labelled sub-images and post-processing: Following the labelling of the sub-images, post-classification is a procedure that eliminates noise, corrects misclassifications, and improves overall accuracy. 6) LULC maps: Predicting LULC maps can assist urban planners and land resource management in taking appropriate action on the land cover.

In this section, we discuss the most commonly used networks like Convolutional Neural Networks (CNN), Fully Convolutional Network (FCN), and Autoencoder (AE) are the three major framework for LULC classification in remote sensing.

Convolutional neural network

Among deep learning methodologies, the convolutional neural network (CNN) is the most effective and powerful framework network. CNNs have been frequently utilised to classify remote sensing data due to their ability to classify complicated contextual images. These techniques are usually not needed to complete an output image prediction. CNNs are feed-forward neural networks that employ spatially local correlation to make decisions by imposing a local connection pattern between neurons in neighbouring layers of the network. Their structure is comprised of a variety of convolutional layers, a maximum pooling layer, and fully linked layers (Zhang et al 2017). Each layer of convolution computes the weighted sum of the preceding feature, calculated using a filter, and then sends the result via an activation function to obtain the final result. When using this approach, the kernel size is calculated in order to find local correlations while maintaining invariance for each location inside the data array. The resultant feature map is generated with invariance down to the lowest feasible units. Finally, a fully-connected neural network is used to link all of the various phases of convolution or pooling layers together in a cohesive unit (LeCun et al 2015). The following is an example of a convolution operation:

$${fn}_l^k(x,y)=\underset{c\;\;a,b}{{\mathrm{XX}}_{ic(a,b)e_l^k(s,t)}}$$
$${F}_{l}^{k}=\left[{fn}_{l}^{k}\left(\mathrm{1,1}\right),\dots , {fn}_{l}^{k}\left(x,y\right),\dots ,{fn}_{l}^{k}\left(X,Y\right)\right]$$

Once the features are extracted, next is pooling or down-sampling operation used to extract the combination of features that are insensitive to translational shifts and minor distortions.

$${P}_{l}^{k}={g}_{\mathrm{p}}\left({F}_{l}^{k}\right)$$

Similarly, Pkl denotes the pooling feature-map of the lth layer for the Kth input feature-map, and gp denotes the pooling operation. In CNN (He et al 2015) pooling formulas include max, average, L2, overlapping, and spatial pyramid pooling. To increase the learning process and provide a decision function for a convolved feature-map are called as activation function. These activation functions speed-up the learning rate and also provide the non-linearity of features. Activation function like ReLu, sigmoid, tanh, maxout and SWISH has same functionality to provide non-linearity and overcome the problem of vanishing gradient.

$${t}_{l}^{k}={g}_{\mathrm{a}}\left({F}_{l}^{k}\right)$$

In the above equation, ga denotes the activation function and Fkl denotes the convolution output, whereas tkl denotes the transformed output. (Nwankpa et al 2018).

Training and optimization of CNN are the major design choices that provide the best performance and address the overfitting problem. As the volume of data increases, the number of additional challenges for training the data tends to grow as well. It is challenging for the model when an unseen or new dataset is introduced. This problem causes overfitting, which can be addressed by dropout and batch normalization. At the end of each round of the training phase, the dropout mechanism is used to deactivate many nodes. The primary goals of batch normalisation are to enforce a zero mean and a one standard deviation for all activation functions in the specified layer and for each small batch, in order to increase overall accuracy, make the network more resistant to overfitting, and accelerate the convergence of the gradient descent process. Finally, the fully connected layer connects each layer with another layer to classify, which is the end part of the CNN model as shown in Fig. 2. It collects information from the feature extraction stage and performs analysis on the output of all previous levels. As a result, data classification is achieved by connecting selected features in a nonlinear manner (Rawat and Wang 2017).

Fig. 2
figure 2

Spatial feature extraction by using CNN model

Fully convolutional neural network

(Ronneberger et al 2015) was first introduced for biological image segmentation, but it is currently used in a variety of remote sensing applications, where it produces promising results in high resolution images (Wurm et al. 2019). Fully Convolutional Neural Network (FCNN) is a widely used network for semantic image segmentation. Various segmentation approaches involve the encoder-decoder framework in FCNN, as shown in Fig. 3, whereas the first part extracts the feature encoding information into a condensed vector known as the encoder, and the second component is the decoder, which decodes the vector data by upsampling it to the spatial resolution (Long et al 2015a). As a result, combining completely encoded and decoded with skip connections helps to prevent the loss of accuracy as shown in Fig. 4 (Badrinarayanan et al 2017). The major operations involved in FCNN are:

  1. 1)

    Convolution Block: The current base networks are configured to accept inputs of the size (H ∗ W ∗ nchannels) required for remote sensing images with three channels (RGB) of Red, Green, and Blue. Each convolution layer has a kernel size and to keep the input’s height and width zero-padding is used.

  2. 2)

    Pooling: By removing the feature from the feature map, the pooling function reduces the size of the input picture.

  3. 3)

    Concatenation: In this layer, the preceding layer’s output encoder part is concatenated with the decoder part by up-sampling the output with the dimensions (H ∗ W ∗ nup)and the concatenated output becomes (H ∗ W ∗ nup ∗ nconv).

  4. 4)

    Up-sampling: This layer doubles the height and width of the image to change the number of pixels with the same value to the same number of pixels.

  5. 5)

    Transpose Convolution: This layer transposes the convolution by switching the dimensions to increase the output.

  6. 6)

    6) Deconvolution: This layer performs the inverse of the convolution function, whereas the deconvolutional layer’s forward pass equals the convolutional layer’s backward pass, and vice versa. Deconvolutions are used to drive the model to learn more accurate outputs.

Fig. 3
figure 3

Spatial feature extraction by using FCNN model (Badrinarayananet al 2017)

Fig. 4
figure 4

Spatial feature extraction by SAE

Autoencoder

An autoencoder (AE) is a key approach for deep learning to feature in a hierarchical manner. Its architecture is composed of three layers: an input layer (encoded layer), a hidden layer, and a reconstruction layer (decoded layer). In comparison to the input and reconstruction layers, the hidden layer contains fewer units. Both the encoded and decoded data have the same number of units, and between each pair of layers, a non-linearity function is applied in the Fig. 4.

It converts an input layer xinRn to a hidden layer hinRh with a latent representation. whereby W is the input’s weight, beta is the hidden layer’s bias vector, and g() is the activation function.

$$h=g(Wx+\beta )$$

Following that, the latent representation h is used to reverse map yϵRn where,

$$y=g(\theta h+\gamma )$$

y denotes the output layer, theta denotes the weight matrix from the hidden layer to the output layer, and gamma is the output layer’s bias vector. The training procedure’s objective is to reduce the reconstruction error j(x,y) between x and y. If the reconstruction error is smaller than a certain value, the latent representation can be employ to minimize the number of features. A lot of AE are piled together to lower the error rate. These hidden layers are sent into the subsequent layer, resulting in the stacked autoencoder pattern (SAE). These arrangements may gradually generate deep features and train each additional layer using a greedy technique. After each layer, a pooling process compresses the features of successively bigger input regions into smaller ones, which can aid in a variety of classification or clustering tasks (Shin et al 2013).

Discussion

Statistical analysis and meta-analysis

The LULC literature reviewed comprises research that uses the DL technique to classify land cover. A systematic literature search was conducted to locate the articles in Scopus database about LULC in image processing using DL. A systematic review has been done to analyze the research paper related to the literature and to achieve the objectives of our research based upon Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) (McInnes et al 2018) and the recommendation for systematic review for prediction model (CHARM model)(Moons et al 2014).

Search strategy

SCOPUS was used as databases, whereas to verify the validity and quality of the result, we limited the search results in journals, conference and book articles. A title and keyword search in the SCOPUS database (search date: 13 January, 2022) using the search query”remote sensing” AND”Deep Learning” to identify the 413 published articles and 25 from other database. After eliminated some articles, several form of information retrieved like”application of remote sensing”,”ML and DL model used in LULC” from 56 relevant articles which were obtained by using the search query”Deep Learning” AND”LULC” which include the journals, conferences and books. A flow diagram of inclusion criteria is depicted in Fig. 5

Fig. 5
figure 5

Flow diagram of the search and inclusion factors using PRISMA AND CHARM model

Various inclusion factors and exclusion factors included to validate the studies based on the motivation of this paper. The exclusion factors are as follows:

  • Non-English language articles.

  • Remote sensing dataset were not included in the articles.

  • Full-text does not provided by the publisher.

  • Studies without an outcome measures.

  • The inclusion factors are as follows:

  • The number of articles focusing on sub-areas of remote sensing applications using ML and DL model.

  • Use cases between the years 2015–2021.

  • Peer-reviewed article, journals, conferences and books.

A total of 438 studies has been identified using SCOPUS and other database, 9 were identified as duplicate studies while 330 were determined as irrelevant to this meta-analysis. The database contained the record of remaining 89 articles which was further screened by using the qualitative and quantitative analysis. The final database of 56 articles were accepted in this meta-analysis.

A concise interpretation of the findings

To identify the articles in the Scopus database, type “deep learning” and “remote sensing” into the search box (search date: January 13, 2022). Based on the query, we retrieved the 438 publications from 2015 to 2021 using the Scopus database in Fig. 6 that identifies the frequency of publications in journals, which was further filtered by refining the articles in the search window by article title, keywords, and abstract. For the statistical analysis, we identified various articles in”DL” and”LULC” queries, which were refined to create the database on different DL models used in LULC, and the number of publications increased during the period (2015–2021) in Fig. 7.

Fig. 6
figure 6

Identified journals with frequency of publications from 2015–2021

Fig. 7
figure 7

Number of publications increases from (2015–2021) in the “Scopus database”

As shown in Fig. 8, distribution of the publication increased during the period of 2015–2021. Most of the journal articles focus on remote sensing applications in various fields. As of 2021, the number of journal articles exceeds the number of conference papers, reviews, and notes, which reflects the industry’s growth in LULC. This demonstrates that DL has a wide range of applications in remote sensing. In Fig. 9, it summarize the statistical analysis of DL approach which provides the increasing frequency of articles from 2015 to 2021. It is predicted that the scope of the article will increase in upcoming years. However, the remote sensing community has shifted its interest in recent years to DL models in light of the remarkable success of DL models in the majority of state-of-the-art approaches for a diverse range of applications.

Fig. 8
figure 8

Distribution of publications (Conference Paper, Article, Review and Note) increases during the period (2015–2021)

Fig. 9
figure 9

Prediction and scope of the publication will increase in upcoming year

As shown in Fig. 10, the LULC analysis using the DL model, CNN is the most often used for classifications, followed by AE, FCN, and RNN during the period of 2018–2021. Due to the higher popularity of CNN and its unique qualities, which make it ideal for processing HSI/MSI remote-sensing images with regularly ordered pixels. The CNN model is capable of obtaining high-level spatial characteristics, which are useful for various analysis tasks in remote sensing.

Fig. 10
figure 10

Distribution of DL models used in LULC from (2018–2021)

Analysis of remote sensing images in LULC utilising DL methods is summarised in this paper, which shows the graphical representation of the higher-frequency of articles in 2021. According to the analysis, Fig. 10, represents the CNN model, which is more popular than other DL models. By studying the current techniques and literature review, we conclude that DL in LULC classification of images is still at a young age and a lot of scope is available.

Advantages and disadvantages of various DL model in LULC

In remote sensing applications, sampling a large number of labeled classes of interest is challenging and error-prone, and most of the DL model is based on the number of labeled training samples that are required to optimize the weight in each iteration. Therefore, such a model requires a lot of time. (Novelli et al. 2017) has shown that a pre-trained model with fine-tuning provides better accuracy. However, many DL models are not generalized as they cannot accept more than three colors (RGB) per channel, which may not be ideal in the LULC classification. As remote sensing images always require extra information, Fu et al (2018). As a result, these models need to be rebuilt and redesigned from the initial which requires sufficient training data (Novelli et al. 2017). Table 7 discusses the advantages and disadvantages of various DL models in LULC.

Table 7 Advantages and disadvantages of DL model in LULC

In this section, we compare the quantitative result of three DL model (CNN, FCN and SAE) using spectral features, spatial features respectively by comparing their result metrics overall accuracy(OA) and average accuracy(AA). The dataset from ISPRS, Indian Pines and Pavia University has been used for quantitative comparison. In terms of classification results based upon the spectral-feature achieved best performance analysis in the given Table 8 and the graphical representation of classification accuracy shown in the Fig. 11.

Table 8 Classification results of different dataset using DL models in LULC
Fig. 11
figure 11

Classification results of DL models in LULC

As seen in the Fig. 11, CNN, FCN and SAE models used for various dataset. CNN based classification model performs better than other models in Pavia university dataset. However, statistically it appears that most of the papers published using CNN model in remote sensing applications as mentioned in the Fig. 10. CNNs are the most powerful DL model for image feature extraction. In comparison to typical shallow models, DL models built using CNNs may hierarchically extract more abstract semantic features from the input images. Using scene segmentation of RS photos, pre-trained CNN models on natural image data sets such as ImageNet (Deng et al 2009) have shown amazing results (Chen et al 2014) (Firat et al 2014). To generate global feature representations for a specific application, deep features can be directly taken from the intermediate layers of a freely accessible CNN architecture, such as AlexNet (Krizhevsky et al 2012) (Simonyan et al 2014) and (Ioffe and Szegedy 2015). In (Hu et al 2015a), multi scale CNN activation functions are used as feature extractors while other coded functions are used for feature encoding method. Fine tuning is the option to provide the valuable approach when new dataset is sufficiently substantial but not large enough to fully train a new network. (Nogueira et al 2017) developed a strategy for finetuning specific high-level layers of the GoogLeNet (Ioffe and Szegedy 2015) using the UC-Merced data set (Xia et al. 2010) achieved the outstanding result. Although supervised deep learning approaches like CNNs and its variations may yield amazing image classification results, there are some drawbacks since they rely on a large number of labelled training data. Several feature-learning models have been successfully used in remote sensing and may be layered to create deep unsupervised models like SAE, sparse coding, RBMs etc. (Zhang et al 2014). (Romero et al 2015) proposed the use of deep CNNs for RS image classification, by using the unsupervised approach to provide the sparse feature representation to train the network. The efficiency of DL-based RS classification approaches in solving real-world situations has been demonstrated in the previous section. Because of the growing availability of RS data and computing resources, fast progress of DL in remote sensing image categorisation is projected in the future years.

Conclusion

LULC analysis is the most emerging research area in remote sensing applications like climate change, urban planning, disaster management, and ecological change etc. The study was motivated by the popularity of the DL approaches in remote sensing for land cover prediction. Due to the availability of various resources, the HSI/MSI imagery and the Landsat dataset are the most frequently used for image based classification in LULC. We have identified the various datasets which will help the researcher to analyse the LULC change and time-series satellite images. Subsequently, various remote sensing software applications has been identified for pre-processing, classification and prediction.

In this research, we employed state-of-the-art DL frameworks in our study, to explore the hierarchical characteristics of LU and LC categories and abstractions or generalizations of the actual terrain or landscape. This study examined the performance of several of the most current DL architectures that are extensively used for pixel-level labelling in a variety of remote sensing applications. Some of the key findings and gaps have been identified to analyse the new opportunities which outperformed the traditional approaches. According to the overall accuracy of DL models with different parameters, the DL models are superior to ML models in remote sensing applications. Furthermore, we have proposed an overall framework of the DL model as a solution to new challenges and discussed the most commonly used approach in LULC analysis. However, this study was motivated by the exponential growth of DL approaches in LULC, which was systematically identified through statistical analysis using the scopus database. The recommendations presented in this paper seek to greatly benefit researchers by providing a uniform approach for presenting architectural setup and DL approach in LULC analysis in the future. We conclude that DL in LULC classification of images is still at a young age and a lot of scopes are available in the future.