Introduction

Forests are vital in combating climate change, storing around 80% of terrestrial carbon (Liu et al., 2017). The carbon cycle and above-ground biomass (AGB) have been prioritized within the list of key biodiversity metrics to be monitored through satellite-based observations (Reddy et al., 2023). Accurate AGB measurement, particularly in spatial terms, supports initiatives like reducing emissions from deforestation and forest degradation (REDD +) and informs forest management plans to reduce carbon stock assessment uncertainties (Kaasalainen et al., 2015). The AGB of forests is typically assessed through conventional field measurements or remote sensing techniques (Sainuddin et al., 2023b; West, 2015). While for small forest stands, accurate AGB calculations are best achieved through direct field measurements (Lu, 2006), employing this method on a regional scale is impractical due to its high cost, labour intensity, and time demands (Lu, 2006; Henry, 2011).

Previous studies (Reddy et al., 2016; Saatchi et al., 2011) have demonstrated the effectiveness of remote sensing in quantifying and monitoring forest biomass on a regional level. Consequently, a range of remote sensors, encompassing both passive and active variants, have been employed to estimate AGB. The estimation of AGB through earth observation data requires the use of allometric equations and satellite-acquired structural or biophysical metrics (Boisvenue & White, 2019). Nonetheless, utilizing earth observation data for estimating AGB presents difficulties, such as choosing appropriate models and dealing with the constraints of data availability (Lu, 2006). Optical remote sensing data such as Landsat is frequently used due to its accessibility, extensive temporal coverage, and moderate spatial resolution (Dogru et al., 2020). Sentinel-2, part of the EU Copernicus program, offers improved forest monitoring in tropical regions with additional spectral bands, enhancing AGB estimation (Li et al., 2021; Mutanga et al., 2012). However, optical sensors face limitations, such as difficulty in penetrating dense canopies, susceptibility to cloud cover, and data saturation in areas with dense canopy cover (Lu et al., 2012; Powell et al., 2010). As Landsat-8, Sentinel-2 is less effective at estimating higher biomass levels. The challenge with saturation of biomass is a known problem with low- to medium-spatial-resolution multispectral data (Steininger, 2000). Synthetic aperture radar (SAR) has demonstrated greater efficiency in assessing medium- to high-stand-level biomass. Owing to regular cloud cover, SAR has proven to be a valuable instrument for evaluating AGB in tropical areas (Lu, 2006; Lu et al., 2016). SAR data offers the advantage of being collected during any weather and at all times of the day or night. Its capabilities include seeing through clouds and thick forest covers while also detecting variations in surface texture, dielectric properties, and water content. SAR can offer detailed insights into forest composition depending on the microwave bands (X-, C-, L-, and P-bands) utilized. Co-polarized and cross-polarized SAR data offer unique insights into the orientation and structural characteristics of forest canopies and tree stems, providing valuable information from the backscattered data (Ulaby et al., 1990a). Even though SAR systems don’t extract the vertical composition of vegetation as adeptly as airborne LiDAR, their wide orbital swath makes them advantageous for regional biomass monitoring.

There are three main approaches for estimating forest bio-physical parameters: Empirical data-driven relationships utilize ground measurements to predict variables using statistical regression but are limited by ground measurement quality and regional specificity (Fuchs et al., 2009; Lu et al., 2012; Næsset et al., 2013; Skowronski et al., 2014; Tian et al., 2012). Inverting physical models based on electromagnetic principles simulate a vegetation stand’s response to radiation interactions and require careful inversion due to simplifications of real-world phenomena (Ulaby et al., 1990b; Cartus et al., 2011, 2012; Santoro et al., 2011; Antropov et al., 2013; Sainuddin et al., 2021, 2023a). Non-parametric machine learning (ML) models, like random forest and gradient boosting, leverage complex relationships without assuming data distribution and integrate multiple sensor data for better estimations (Behera et al., 2023; Breidenbach et al., 2012; Jung et al., 2013; McRoberts et al., 2012; Mitchard et al., 2013; Mutanga et al., 2012; Saatchi et al., 2009). Previous research (Kellndorfer et al., 2010; Walker et al., 2007) has shown that integrating data from multiple sensors performs better than data from a single sensor in generating accurate biomass estimations. In the fusion of optical and radar data, numerous investigations (Li et al., 2020; Malhi et al., 2022) have incorporated multispectral bands, vegetation indices, and texture parameters from optical sensors, coupled with radar backscatter coefficients. Additionally, the textures generated from satellite imageries are known for their notable robust adaptability, and are leveraged in many previous studies (Dang et al., 2019; Dong et al., 2020; Eckert, 2012; Kelsey & Neff, 2014) and have confirmed the efficacy of these parameters in AGB assessment.

In this research, the AGB of tropical deciduous forests in the Purna regional forest landscape was estimated by integrating Sentinel 2 optical data with Sentinel-1 SAR data in association with topographical features from SRTM data and the GEDI canopy height product, as referenced in Potapov et al. (2021). Three ML models—random forest (RF), extreme gradient boosting (XGB), and boosted regression tree (BRT)—were methodically utilized in various modelling contexts to evaluate their performance in predicting AGB. The performance of these techniques in AGB prediction was rigorously evaluated by contrasting them against field-measured data, offering insights into their effectiveness and accuracy.

Materials and Methods

Study Area

The selected study area is the Purna regional landscape, which includes the Purna Wildlife Sanctuary and surroundings (20° 51′—21° 21′N & 73° 32′—73° 48′ E) spanning the Dang district of Gujarat, India. The study area was outlined by generating a 2 km buffer extending from the boundaries of Purna Wildlife Sanctuary. The landscape spans around 324.88 km2, with 252.36 km2 of this area covered by forests, representing the northern region of the Western Ghats (Reddy et al., 2015). It is in the basins of the Purna and Gira rivers. The highest peak is Walu Dungar, rising to an altitude of 574 m. It experiences a predominantly dry climate. The Southwest Monsoon predominates from June to September. Purna features both moist and dry deciduous forests (Champion & Seth, 1968). The dominant tree species in the study area include Tectona grandis, Wrightia tinctoria, Terminalia alata, Haldina cordifolia, Acacia catechu, Butea monosperma, Desmodium oojeinense, and Mitragyna parvifolia. The study area was outlined by generating a 2 km buffer extending from the boundaries of Purna Wildlife Sanctuary (Fig. 1).

Fig. 1
figure 1

Location map of the study area showing distribution of sample plots on the false colour composite of Sentinel-2 imagery

Field Sampling and AGB Estimation

The forest area was stratified based on the forest-type map from Reddy et al. (2015). Field inventory data was collected between 2019 and 2020 across 106 distinct 0.1 ha sample plots spread throughout the study area. This ensures the representation of the diversity of biomass within different forest types. A sampling intensity equivalent to 0.1% of the total forest area was selected due to practical feasibility. Stratified random sampling was utilised to establish these plots, and their coordinates were recorded using a global positioning system (GPS). For each plot, parameters such as height, diameter at breast height (DBH), number of individuals, and species names were documented. The AGB was estimated using an allometric equation (Eq. 1) that incorporated tree height and Diameter at Breast Height (DBH), with distinct coefficients specific to dry and moist deciduous forests proposed by Chave et al. (2005). In the sampled plots, 75.47% were located in the dry deciduous forests, and 24.53% were found in the moist deciduous forests.

$$ln AGB = {\text{ a}} + {\text{b ln}}\left( {\rho D^{2} H} \right)$$
(1)

Here, ρ signifies the wood density of the tree as suggested by the Forest Research Institute (Chowdhury & Ghosh, 1958), D stands for the diameter at breast height in centimetres, and H denotes the height of the tree, expressed in meters. Table 1 presents the unique coefficients for different forest types applied in the allometric equation. Figure 2 depicts the frequency distribution of the field-measured AGB. Table 2 shows the statistical overview of the field measured AGB in (t/ha) from the sampled plots.

Table 1 Values for coefficients applied in allometric equation
Fig. 2
figure 2

Histogram showing field-measured AGB distribution

Table 2 Statistical overview of the field measured AGB in (t/ha) from the sampled plots

Satellite Data and Predictor Variables

Sentinel-1 Data

The Sentinel-1 program features two satellites: Sentinel-1A (S1A; launched on April 3, 2014) and Sentinel-1B (S1B; launched on April 25, 2016). This satellite is designed with rapid revisit times, broad coverage, and rapid data distribution. Sentinel-1 operates a C-band imager at 5.405 GHz, with an incidence angle ranging from 200 to 450. The satellite maintains a Sun-synchronous, near-polar orbit at an altitude of 693 km. For this study, dual polarization (VV + VH) data from the Sentinel-1A interferometric wide (IW) ground range detection (GRD), acquired on May 3, 2019, was used. The data was accessed freely from the ESA Copernicus hub (https://sentinel.esa.int/web/sentinel/sentinel-data-access). The data preprocessing was conducted using the Sentinel Application Platform (SNAP) (version 8). Once the orbit was applied, the SAR data underwent radiometric calibration and then thermal noise removal. The data was resampled to a pixel size of 30 m to match the size of the sampled field plots. To mitigate the speckle noise in the image, a Gamma MAP filter with a 9 × 9 pixel window was employed.

Sentinel-2 Data

Sentinel-2 (S2A and S2B) has a powerful multispectral instrument (MSI) for advanced optical remote sensing. It offers 13 bands spanning various spectrums in a short 5-day revisit cycle. The spectral bands are divided into three separate spatial resolutions: 10 m, covering the blue, green, red, and near-infrared (NIR) bands; 20 m, including three vegetation red edge bands, a narrow NIR band, and two shortwave infrared (SWIR) bands; and 60 m, which capture the coastal aerosol, water vapor, and SWIR-cirrus bands. The data acquired from the ESA Copernicus hub for January 18, 2020 was used. The pre-processing of the data was primarily done with the Sen2cor tool in SNAP for atmospheric correction, and then the data was resampled to 30 m pixel spacing to align with the field plot dimensions. The data was then geocoded using the Shuttle Radar Topography Mission (SRTM) digital elevation model.

Predictor Variables

This study utilized the Sentinel-1 SAR as a key component in the analysis, using the VV and VH polarizations as predictor variables. The Principal Component Analysis (PCA) was applied to the multispectral bands of Sentinel-2 data to minimize dimensionality while preserving the variability between them. The initial two principal components, PC1 and PC2, accounted for 90% of the dataset variance and were selected for subsequent texture processing. The Gray-level Co-occurrence Matrix (GLCM) method (Haralick et al., 1973) was utilized, where eight GLCM elements were calculated within a 3 × 3 processing window using the SNAP toolbox. Additional predictor variables incorporated include vegetation indices from Sentinel-2 data, such as the Green Normalized Difference Vegetation Index (GNDVI) (Gitelson & Merzlyak, 1998), Green Red Vegetation Index (GRVI) (Tucker, 1979), and Normalized Difference Red Edge Index (NDRE1) (Gitelson and Merzlyak, 1996). These indices were chosen based on the correlation test with field-measured AGB, where GNDVI, GRVI, and NDRE1 emerged as the leading contributors, excluding other indices to prevent the impact of multicollinearity. The Leaf Area Index (LAI) was obtained through the biophysical processor available in the SNAP toolbox, serving as an indicator for biophysical parameters and aligning with the PROSAIL model (Jacquemoud et al., 2009). The assessment also integrated predictor variables like the global canopy height product (Potapov et al., 2021), elevation, slope, and aspects derived from SRTM data. All predictor variables were resampled to 30 m resolution to correspond with the field plots using the nearest neighbourhood method within the resample function of the SNAP toolbox. To mitigate the impacts of location inaccuracy, three neighbourhood statistics (minimum, maximum, and mean) for each variable were computed (Carreiras et al., 2013). This approach resulted in a one-pixel value at each field plot centre, supplemented by three neighbourhood statistical values for each plot, resulting in a total of four values per variable. Consequently, 112 predictor variables were available for modelling purposes. SAR polarizations, along with physical, spectral, biophysical, and texture parameters, were utilized in combinations as predictor variables within the models. For AGB estimation, four selected ML models were examined, each utilizing various variable combinations: (i) Model 1, which estimated AGB using polarizations and physical variables (27 in total), (ii) Model 2, which estimated AGB by combining both spectral and biophysical variables (40 in total), (iii) Model 3, which estimated AGB using only texture variables (68 in total), and (iv) Model 4, which estimated AGB by combining polarizations, physical, spectral, biophysical, and texture variables (112 in total). The choice of predictor variables was guided by findings from previous research, which suggest that integrating polarization channels, textural parameters, and spectral data often leads to reliable AGB estimates in different Indian forest ecosystems. Despite this, the exact combination mentioned in the previous studies was not adopted in the analysis. The predictor variables used for the study are listed in Table 3. A detailed list of employed predictor variables and their details is available in the Supplementary File.

Table 3 Parameters and variables for each model

Methods and Modelling

The workflow diagram (Fig. 3) provides a visual representation of the AGB estimation process and implementation of the ML models outlined in this study.

Fig. 3
figure 3

Methodology for estimation of AGB from ML-models

The procedure consists of the following phases:

  • Pre-processing the satellite images and deriving vital predictor variables

  • Training the selected ML models in distinct modeling scenarios

  • Evaluating the efficacy of the models against a test dataset

  • Generating the AGB map based on the best-performing model

This approach involved integrating data from Sentinel-1 and Sentinel-2 with terrain attributes from SRTM data and the canopy height product. The model’s performances were then cross-checked against ground truth data for accuracy. In this study, advanced ML algorithms, including RF, XGB, and BRT, were implemented to predict AGB in different modeling scenarios. Custom Python 3 scripts were utilized for both the modeling and validation processes.

Random Forest Model

The RF operates as an ensemble-learning algorithm, leveraging an extensive collection of decision trees for both regression and classification tasks. Decision trees, a widely recognized approach in machine learning, operate based on specified instructions or conditions for input variables, progressing from the tree’s root to its leaves (Quinlan, 2014). These trees utilize binary division to assign clusters of input variables to each node during the formulation of the regression tree. It’s essential to fine-tune both the number of regression trees and the quantity of input variables for each node. Predictions are then determined by averaging across all tree nodes. The underlying principle of RF centers on amplifying the reduction in variance, by minimizing the correlation among trees (Hastie et al., 2009). To achieve this, input variables are chosen at random during tree development phases.

Extreme Gradient Boosting Model

XGB (Chen et al., 2016) is an advanced ML algorithm that has garnered widespread recognition for its superior performance in Kaggle competitions. This model, which is an optimized version of gradient-boosted regression trees, is tailored for enhanced speed and efficiency. It leverages the second-order derivative of the loss function to hasten convergence and incorporates a regularization component to mitigate the risk of overfitting. As a result, XGB stands out as a versatile and scalable solution, especially adept at managing sparse datasets and achieving rapid convergence.

Boosted Regression Tree Model

The BRT model merges the principles of boosting with the decision tree algorithm to enhance predictive performance. Boosting contributes to reducing the risk of overfitting by selecting random subsets of the training data upon which to base the fitting of new trees. Unlike the RF model that apply bagging, BRTs employ a boosting approach, assigning varying weights to the input data for each successive tree (Biodiversity & Climate Change Virtual Laboratory, 2021). This method ensures that data points that were inadequately predicted by earlier trees are given a greater likelihood of influencing the formation of subsequent trees. Such a strategy increases the model’s precision by allowing it to correct for errors from previous trees when constructing the current one.

Tuning Process of ML Models

To identify the optimal settings, a series of tests employing various tuning parameter values were performed. Refining the tuning parameters for ML models revealed that the accuracy of RF models increased with the addition of trees until reaching a consistent level at a ‘ntree’ setting of 500. In the case of RF, the impact of the ‘mtry’ parameter was more pronounced with fewer trees, diminishing as tree numbers grew. Optimal performance for Model 1 was achieved with an ‘mtry’ of 10, where R2 slightly increased and RMSE remained stable as the tree count increased. Model 2 exhibited a more intricate R2 trend, yet RMSE was simpler to delineate, favoring an ‘mtry’ of 5. The optimal ‘mtry’ parameter for Model 3 was determined to be 3. For the Model 4, an ‘mtry’ of 10 delivered the best outcomes, with R2 and RMSE settling into a consistent range. XGB models showed less sensitivity to gamma, but required balance in child weight as tree depth increased to maintain accuracy. Lower learning rates were beneficial, preventing overfitting and necessitating more iterations for accuracy, with the optimal rate set at 0.01. The optimal subsample rates were below the default, set at 0.5, 0.7, 0.6, and 0.8 for the Model 1, Model 2, Model 3, and Model 4, respectively. The model performance also exhibited a positive correlation with lower ‘nrounds’, maintaining stability as boosting iterations increased. Choosing the right ‘nrounds’ was critical and differed from RF model selection. In the tuning of the BRT model, the learning rates were set within a spectrum from 0.001 to 0.03, specifically being 0.009 for Model 1, 0.001 for Model 2, 0.005 for Model 3, and 0.03 for Model 4 and ‘ntree’ was set to an optimal value of 500. This methodical approach of parameter adjustment led to the development of an optimal ML models.

Model Validation and AGB Estimation

To estimate AGB, four distinct modeling scenarios were evaluated, each employing different sets of variables. The field dataset was divided randomly, with 80% for model training and the remaining 20% for validation. To determine the most effective model for each variable combination, a five-fold cross-validation approach (Kuhn & Johnson, 2013) was employed on the training dataset. The coefficient of determination (R2), root mean square error (RMSE), and mean absolute error (MAE) were analyzed and compared across these models to identify and select the most effective model for mapping AGB. The AGB map was generated using the most accurately fitted model, with a spatial resolution of 30 m across the study area.

Results

This section reveals the findings derived from the study, which concentrates on estimating AGB through the application of multiple ML models using satellite data. The outcomes yielded from the application of advanced ML algorithms such as RF, XGB, and BRT with the combinations of different datasets have been thoroughly and strategically analyzed to identify the effectiveness and accuracy of each in estimating AGB. Figure 4 presents a categorical analysis of the importance of predictor variables. The spectral variables were identified as the most significant, whereas texture and polarization variables also exhibit substantial importance. Physical variables were found to be the least important in this analysis, as indicated by their lower median value and the presence of outliers in the data.

Fig. 4
figure 4

Importance of predictor variables (category wise)

Predictive Modeling of AGB

Figure 5 presents the validation results of the predicted AGB against the observed values for Model 1, using the selected ML algorithms. For the RF model, a moderate correlation is observed with an R2 value of 0.52, an RMSE value of 42.25 t/ha, and a MAE value of 35.89 t/ha. The XGB model exhibits an R2 value of 0.51, an RMSE value of 41.64 t/ha, and a MAE value of 35.44 t/ha. Lastly, the BRT model presents comparable results with an R2 value of 0.47, an RMSE value of 43.02 t/ha, and a MAE value of 37.47 t/ha.

Fig. 5
figure 5

Validation plots for the predicted AGB for a RF, b XGB, and c BRT obtained from the model 1

Figure 6 illustrates the validation results for the prediction of AGB in Model 2, employing various ML algorithms. For the RF model, the outcomes indicate a moderate correlation with an R2 value of 0.46, an RMSE value of 42.56 t/ha, and a MAE value of 37.07 t/ha. In the case of the XGB model, the results manifest a correlation with an R2 value of 0.51, an RMSE value of 42.56 t/ha, and a MAE value of 37.70 t/ha. Conversely, the BRT model demonstrated results with an R2 value of 0.44, an RMSE value equal to 40.55 t/ha, and a MAE value of 34.76 t/ha.

Fig. 6
figure 6

Validation plots for the predicted AGB for a RF, b XGB, and c BRT obtained from the Model 2

Figure 7 delineates the validation results of predicted AGB against observed values for Model 2 for the selected ML models. For the RF model, there’s a moderate correlation observed with an R2 value of 0.39, supplemented by an RMSE value of 44.99 t/ha and a MAE value of 37.40 t/ha. In the XGB model, the performance is slightly varied, with an R2 value of 0.44, an RMSE value of 42.37 t/ha, and a MAE value of 35.71 t/ha. Conversely, the BRT model showcased a moderate R2 value of 0.38, an RMSE value of 45.61 t/ha, and a MAE value of 36.52 t/ha.

Fig. 7
figure 7

Validation plots for the predicted AGB for a RF, b XGB, and c BRT obtained from the Model 3

Figure 8 showcases the validation results of AGB prediction in Model 4, utilizing the selected ML models. The RF model displays a modest correlation with an R2 value of 0.49, an RMSE value of 41.11 t/ha, and a MAE value of 35.27 t/ha. On the other hand, the XGB model results indicate a strong performance with an R2 value of 0.61, coupled with an RMSE value of 37.85 t/ha and a MAE value of 32.47 t/ha. The BRT model unveils outcomes with an R2 value of 0.41, an RMSE value of 41.81 t/ha, and a MAE value of 35.52 t/ha.

Fig. 8
figure 8

Validation plots for the predicted AGB for a RF, b XGB, and c BRT obtained from the model 4

The various ML models applied in this study, namely RF, XGB, and BRT, exhibited a spectrum of performances in predicting AGB across different models. For instance, the RF model, showing variability in performance, managed to present reasonable outcomes in certain models. The XGB model consistently demonstrated moderate to strong correlations in the predictions across all the models. The BRT model displayed variability in its performance yet yielded satisfactory results in some of the tested models. The XGB algorithm showed its strongest performance in Model 4, yielding the highest R2 value. RF performed at its best in Model 4, where it demonstrated a relatively lower error in estimating AGB. The BRT algorithm showed its optimum performance in Model 2, showing a comparatively lower estimation error for AGB.

Figure 9 presents a comparison of the R2, RMSE, and MAE across the different models. The diversity in model performances underscores the importance of selecting an appropriate ML algorithm tailored to the specific characteristics and requirements of each dataset and model to enhance the accuracy and reliability of AGB predictions.

Fig. 9
figure 9

Performance metrics of four models on AGB Prediction using RF, XGB and BRT models- (a) Coefficient of determination (R2), (b) Root Mean Square Error (RMSE) in t/ha (c) Mean Absolute Error (MAE) in t/ha

Spatial Mapping of AGB

The distribution map of AGB, depicted in Fig. 10, was produced at a 30 m spatial resolution derived using the XGB algorithm leveraging Model 4 variables. The XGB model incorporating Model 4 demonstrated the highest R2 and the lowest RMSE in comparison to other models. The mean AGB recorded in the field is 94.83 t/ha, while the mean for the predicted AGB is 41.45 t/ha. The predicted AGB within the study area spans from a minimum of 23.43 t/ha to a maximum of 176.61 t/ha. The AGB map utilizes a gradient colour scheme that transitions from yellow to a dark green, representing a range of AGB values from 23.43 to 176.61 t/ha. It depicts AGB density categorized into four distinct ranges, each represented by a colour on the legend. Most of the mapped area is dominated by AGB values in the range of 50–100 t/ha, as indicated by the prevalence of the lime colour. Following this, the next most extensive category is the 100–150 t/ha range, represented by olive shade, which corresponds to regions with relatively higher AGB. The map also shows substantial areas within the 23.43–50 t/ha category, highlighted in yellow, implying regions with a lower biomass density. The darkest green pixels on the map represent the areas with the highest AGB, ranging from 150 to 176.61 t/ha.

Fig. 10
figure 10

AGB map predicted using best fitted model

Discussions

The study’s findings indicate that multiple ML models such as RF, XGB, and BRT have varied performance in predicting AGB using the selected satellite data, with the RF model generally showing moderate to strong correlations. The strongest performance was observed in Model 4 using the XGB algorithm, achieved the highest R2 and lowest RMSE values, indicating its superior accuracy in AGB estimation. The spatial distribution of AGB was mapped at a 30 m resolution, with the majority of the area displaying AGB values in the range of 50–100 t/ha. This illustrates the proficiency of ML methods in precisely estimating AGB (Dube & Mutanga, 2015). Non-parametric models excel in managing the non-linear relationships between forest AGB and satellite data (Liu et al., 2017). Furthermore, the ability of the ML algorithms to manage non-linearity and assess the significance of predictor variables underscoring its effectiveness (Pandit et al., 2018). This research applied an allometric equation originally proposed by Chave et al. (2005) to estimate the AGB. This method takes into account both the DBH and the height of trees within the sample plots. In a study conducted by Lambert et al., (2005) found that adding tree height to allometric equations, alongside DBH, improves the accuracy of tree volume estimates and decreases the root mean squared error in predictions of total tree biomass. Furthermore, another research conducted by Frank et al., (2018) highlighted the importance of including tree height in models to better reflect variations across different locations.

Relationship Between Satellite Data and AGB

The integration of optical and SAR data marks an advancement in forest AGB estimation over the use of either data source in isolation. While optical imagery provides detailed information on the horizontal layout of forests, its penetrative capacity is limited, primarily capturing surface features rather than the full vertical profile (Myneni et al., 2001). SAR data, particularly at longer wavelengths such as L-band and P-band, can pierce through the canopy to reveal the crucial vertical structure indicative of AGB, which is predominantly composed of stem and branch biomass. The synergistic use of both optical and SAR data leverages the strengths of each. This combined approach, therefore, holds significant promise for enhancing the accuracy and reliability of AGB measurements. This study selected Sentinel-1 SAR data at C-band, because it was readily available for the geographic location of the study. The study examined the VV and VH polarization channels as the predictor variables of the SAR data. The accuracy of AGB estimation by SAR can be compromised by the terrain and can suffer from signal saturation in very dense or high-biomass areas (Imhoff, 1993; Le Toan et al., 1992; Luckman et al., 1997). It has been documented that C-band SAR backscatter typically reaches saturation at AGB levels ranging from 30 to 50 t/ha (Lucas et al., 2015). In the case of optical data, NDVI and EVI are commonly utilized vegetation indices, yet in this study, NDRE1 and GNDVI were found to be superior in estimating AGB in the correlation analysis. This aligns with Wang et al. (2007), who found GNDVI more precise than NDVI in LAI estimation across various conditions. Likewise, these findings are consistent with the research conducted by Otsu et al. (2019), who reported the superior performance of GNDVI in differentiating between broadleaf and needleleaf forests compared to NDVI. Supporting this, Yoder and Waring (1994) identified the green spectral band as more correlational with photosynthetic activity in the tree canopies of miniature Douglas-firs than the red spectral band. The difference in efficacy between NDVI and GNDVI can be attributed to NDVI being more sensitive to lower chlorophyll concentrations, while GNDVI is more effective at detecting higher chlorophyll levels, thereby providing greater accuracy in assessing chlorophyll concentration in tree crowns (Gitelson et al., 1996). In this study, NDRE1 also emerged as a superior predictor for estimating AGB primarily due to its sensitivity in capturing chlorophyll content. The sensitivity of the red-edge bands is particularly crucial, as the reflectance in these bands is influenced by the thickness of the tree canopy layers. Research conducted by Horler et al. (1983), and Eitel et al. (2011), has shown that the red-edge spectral band is particularly adept at estimating AGB in areas of dense canopy coverage, providing a more accurate measurement than traditional vegetation indices through its ability to detect chlorophyll absorption and reflection in leaves. This finding is supported by Mutanga et al. (2012) and Laurin et al. (2018), who have also reported a relationship between the reflectance of red-edge bands and factors such as canopy density and biomass. Since NDRE1 effectively captures variations in these red-edge bands, it serves as a more accurate indicator of the chlorophyll content and, by extension, the overall health and biomass of the canopy. This sensitivity makes NDRE1 particularly effective in environments with dense vegetation, where traditional indices like NDVI might be less responsive due to saturation. NDRE1’s ability to detect subtle changes in chlorophyll content in these dense canopy layers provides a more nuanced and accurate estimation of biomass, distinguishing it from other vegetation indices and explaining its superior performance in the study. In compliance with the previous studies (Ali et al., 2015; Ghosh & Behera, 2018; Liu et al., 2019; Sinha et al., 2015), this study has also demonstrated that by integrating SAR parameters with optical (particularly the red-edge (B5) spectral band) and terrain parameters in ML models, the saturation threshold for biomass density measurements increases, extending up to a higher value.

Efficacy of Machine Learning Approaches in AGB Estimation

Earlier studies on biomass estimation predominantly employed standard statistical regression techniques, for instance, linear regression, which implied a direct linear correlation between independent and dependent variables (Dong, et al., 2003; Le Toan et al., 1992). However, the complexity of the relationship between AGB and satellite data is not adequately addressed by these classical methods. Advanced ML approaches, like RF and XGB, are adept at delineating the intricate non-linear relationships present within heterogeneous data distributions and effectively integrating diverse data sources to enhance the accuracy of biomass estimations. Many previous studies revealed that combining ML algorithms with multi-sensor RS data helps in preventing overfitting and significantly enhances estimation accuracy. For instance, a study conducted by Behera et al. employed a combination of 71 spectral and texture variables, derived from Sentinel-2 in the RF model for estimating AGB in the regional landscape of Eastern Ghats (Behera et al., 2023). Another study conducted by David et al. combined Sentinel-1 SAR and Sentinel-2 multispectral imagery in the RF model to assess AGB of dryland forests of Southern Africa (David et al., 2022). In a related study, Singh and the team compared the efficacy of RF and Artificial Neural Network (ANN) models to estimate the AGB of dry deciduous forests using Sentinel-2 data of different seasons (Singh et al., 2022). In their study, Ghosh and Behera used RF and stochastic gradient Boosting modelling to assess the AGB of dense tropical forests by harnessing 70 predictor variables derived from Sentinel-1 and Sentinel-2 data (Ghosh & Behera, 2018). Similarly, the present study incorporated 112 predictive variables from Sentinel-1, Sentinel-2 data along with variables derived from elevation data and the height product (Supplementary File). Among the three modelling approaches analysed in this study, XGB achieved the best results, exhibiting the highest R2 and the lowest RMSE, outperforming both the RF and BRT models. The superior performance of XGB in this study can be primarily attributed to its inherent algorithmic strengths. XGB represents an enhanced gradient boosting framework known for its flexibility and ability to adjust residuals in the process of developing new trees from existing ones, unlike the RF model where trees are constructed independently (Chen & Guestrin, 2016; Friedman, 2002). XGB represents a more refined version of gradient boosting systems, excelling in processing a regularized learning objective, a feature instrumental in mitigating overfitting (Chen & Guestrin, 2016). However, it’s important to note that challenges like overestimation and underestimation, a common issue in ML algorithms for AGB estimation, were not entirely resolved (Stelmaszczuk-Górska et al., 2015). A key limitation of the decision trees, fundamental to both RF and XGB methods, is their inability to extrapolate beyond the data present in the training set. Moreover, when employing remote sensing datasets for biomass estimation, issues of data saturation can arise (Mutanga & Skidmore, 2004). Additionally, the limited number of plots used in this study restricted the opportunity for a more stratified estimation approach, which might be based on different biomass levels or forest types. Such an approach could potentially reduce estimation errors further. Li et al. (2021) observed that XGB surpassed RF in performance, and another comparison by Li et al. (2020) revealed that XGB excelled beyond both RF and linear regression. The findings of this study are also in concordance with the research done by Zhang et al. (2021) and Luo et al. (2022), which have shown that XGB tends to surpass RF in the performance of regression models. The RF algorithm demonstrated greater ease of calibration and resilience against overfitting compared to BRT, an advantage linked to the bagging technique, which lessens the prediction model’s variance. This aligns with the literature indicating superior performance of the RF model over BRT (Wang et al., 2018).

Multi-Sensor Earth Observation Studies in Indian Forests

Studies have employed remote sensing methods to investigate the biomass of Indian forests, adopting either single or combined use of optical, SAR, and LiDAR data. Reddy et al. (2016) explored the spatial distribution of biomass carbon density in Indian forests from 1930 to 2013 using satellite remote sensing data, historical archives, and collateral data. The study estimated the total aboveground carbon stock (3070.27 Tg C) in 2013, with notable variations observed through different periods. In a study carried out by Ghosh and Behera (2018), they investigated AGB estimation in dense tropical forests using multi-sensor data from Sentinel-1A and Sentinel-2A satellites, combined with machine learning algorithms like RF and stochastic gradient boosting. Their research, focused on Shorea robusta and Tectona grandis species in Katerniaghat Wildlife Sanctuary, Uttar Pradesh, demonstrates the efficacy of integrating SAR data, texture images, and vegetation indices in enhancing AGB estimation accuracy, highlighting the potential of Sentinel satellite data and machine learning in forest biomass assessments. Singh et al. (2022) applied a methodology employing open-source satellite data and ML techniques to monitor AGB at finer scales in Tundi Reserved Forest, Jharkhand. Their case study in the dry deciduous tropical forest of Tundi forest highlighted the superior performance of RF and ANN models using wet season Sentinel-2 data, while dry season data proved challenging for AGB estimation, underscoring the potential of the methodology in enhancing forest carbon stock monitoring. Bhandari and Nandy (2023) conducted research that utilized terrestrial laser scanning (TLS) and satellite-derived forest canopy density (FCD) and spectral indices to predict AGB in the Barkot Reserve Forest in Uttarakhand, demonstrating a strong correlation between TLS measurements and field data. Their approach, combining TLS data with FCD classifications from Landsat-8 OLI, proved effective in estimating the study area’s AGB with high precision. Another study conducted by Singh et al. (2023) Barkot Reserve Forest focused on integrating TLS and ALOS PALSAR L-band SAR data for AGB estimation using machine learning algorithms. The research combined various SAR-derived parameters with TLS measurements of tree dimensions, finding that the RF algorithm outperformed the ANN in AGB prediction, demonstrating the potential of SAR and LiDAR data fusion in enhancing forest biomass assessments. In research conducted by Behera et al. (2023) on estimating regional forest landscape AGB integrated textural and spectral variables from Sentinel-2 with ancillary data, effectively overcoming optical remote sensing saturation effects. Utilizing an RF model, the study achieved a significant correlation in AGB variability, demonstrating the potential of this integrated approach for enhancing AGB mapping accuracy and its applicability in developing generalized AGB models. Sainuddin et al. (2023a) investigated the use of multifrequency SAR data in estimating AGB in the tropical forests of the Western Ghats region of Kerala by applying a vector radiative transfer (VRT) theory-based scattering model. The study utilized dual-pol SAR data from L-band ALOS-2, S-band NovaSAR, and C-band Sentinel-1 to retrieve biophysical parameters like tree height and trunk radius, which were then used to estimate AGB using a general allometric equation. Validation with ground truth data showed the L-band data provided the most accurate AGB estimates, demonstrating its superior potential in biomass estimation over S- and C-band data. In a study conducted by Ayushi et al. (2024), they addressed the complexity of estimating AGB in tropical biodiversity hotspots by employing seven machine learning algorithms to analyse multisource datasets, including Sentinel-1 and -2, topography, soil, and climate. Their findings highlight the effectiveness of an ensemble stacking approach, which integrates these diverse datasets for AGB prediction, showcasing high accuracy and the importance of environmental variables in enhancing estimation precision.

Conclusion

This research has integrated SAR and multispectral imagery from satellites along with physical parameters to map AGB across the deciduous forests of the Purna regional landscape in the Western Ghats. The findings of this study indicate that the enhanced accuracy in AGB estimation can be achieved through the synergy of different data types—both SAR and multispectral sensors. By meticulously applying and comparing models like RF, XGB, and BRT, the study has unveiled their unique advantages when used in synergy with satellite data. The models demonstrated their capability to handle the complex, non-linear relationships that exist between the satellite-derived variables and AGB, with XGB consistently surpassing the performance of RF and BRT in accuracy. Model 4, leveraging XGB, emerged as the most precise, with its superior performance being reflected in the highest R2 of 0.61 and the lowest RMSE of 37.85 t/ha. The spatial analysis at a 30 m resolution highlighted the distribution of AGB across the landscape, revealing the effectiveness of ML methods in capturing the gradations of biomass densities, from low to high AGB ranges. The study demonstrates that the fusion of freely accessible SAR and multispectral data (from Sentinel-1 and Sentinel-2) has the capacity to enhance the accuracy of AGB estimation. SAR backscatter data, when combined with selected optical band data, particularly from red-edge wavelengths, markedly improved the efficacy of the estimation process and mitigated the saturation phenomena usually seen in high biomass areas. Indices such as NDRE1 and GNDVI exhibited stronger linear correlations with AGB than traditional indices like NDVI, with GRVI and EVI. The precision and timeliness provided by these methods are vital for a deeper comprehension of tropical forest ecosystems and for the effective management of forest resources within protected areas. Moving forward, using new technologies and methods could make the estimation of AGB even more accurate. Advancements in sensor technology, including the arrival of higher-resolution satellite imagery, promise to provide data with greater detail, facilitating a more accurate analysis of AGB. The next generation of sensors, including LiDAR profilers like ICESat-2, GEDI, and MOLI, along with SAR sensors such as NISAR, BIOMASS, and ALOS-2, are poised to deliver unparalleled precision and resolution in AGB measurements. Exploring the potential of convolutional neural networks and other deep learning frameworks might reveal patterns and correlations in environmental data that are currently underutilized. As the accuracy of AGB estimation continues to improve, these methodologies hold great promise for better informed and more effective environmental policy and resource management decisions.