Introduction

The soybean crop [Glycine max (L.) Merr.] plays a strategical role in the food and energy security issues, being one of the most important legume species cultivated worldwide (~ 120.5 million ha) (FAOSTAT, 2021). Brazil is the largest producer of this particular crop, with approximately 137.2 million tons of grains on 38.9 million hectares harvested during the last growing season (i.e. 2020/21) (Conab, 2021). The average soybean yield in Brazil is around 3.000 kg ha–1, but due to the high technology associated to optimum management practices used at some farms, farmers can reach yields greater than 10.000 kg ha–1 under commercial conditions (Battisti et al., 2018).

During the last decades, many efforts have been made for better understanding the geospatial and temporal variability of crop yields at large-scales (regional or national). A snapshot of the past and actual agronomic and climatic scenarios is essential regarding resources allocation, efficient market strategies, and socioeconomic policies towards closing gaps in agricultural production systems. Crop yield—the production (e.g. soybean grains, sugarcane stem and grassland biomass) per unit of land area (e.g. hectare)—is one of the mostly used metrics to indicate the level of agricultural development from a particular region (Lobell et al., 2009). However, its estimation at large scales is one of the major challenges that policy-makers and governmental agencies have faced for draw efficient agriculture strategies (van Klompenburg et al., 2020). Uncertainties associated with uneven distribution of yield data collected from farmer’s surveys, the spatial variability of soil, relief and weather even at small scales, the heterogeneity of inputs and genotypes being used to achieve those yields, and the need to account for gradual changes of the latter over time (plant breeding, technology adoption, policy changes) still pose considerable challenges (Hampf et al., 2020; van Bussel et al., 2015).

Agricultural models are powerful tools for assessing the effect of different environmental and management conditions on crop yields. The use of them has become popular in the last decades, following the pronounced advances in technology, since the access to data resources and computational processing also have substantially increased along last decades (Jones et al., 2017). Moreover, those tools are fundamental to identify opportunities for enhancing global food production, mitigate the GHG emissions, and shrink food insecurity in a sustainable way (Cassman & Grassini, 2020; Ewert et al., 2015).

Process-based crop simulation models have been developed and tested for better understanding of the relationships involved on crop growth, environmental conditions and management practices (Jones et al., 2017; Nendel et al., 2014). They are particularly interesting in evaluating the impact of environmental conditions or management strategies on multiple target variables, and their trade-offs, simultaneously. Process-based models are often developed under experimental field conditions and require detailed information for running the simulations (e.g. weather, soil and management), which in many agricultural regions are still rarely found (Ramirez-Villegas & Challinor, 2012). Furthermore, appropriate calibration of these models still pose a major challenge (Wallach et al., 2021).

On the other hand, data-driven models (machine learning algorithms or statistical models) have been also massively used during the last years due their flexibility concerning inputs required. This group of models is often used for investigate the relationship between a target variable (e.g. crop yield) and a set of explanatory variables (e.g. crop, weather, soil, management and vegetation indices) (Kang et al., 2020; Webber et al., 2020). Although there are some limitations of data-driven models due its intrinsic characteristics (e.g. they do not allow to understand a particular crop growth process), they present some advantages compared to process-based models. For example, data-driven models are flexible regarding its inputs, i.e. they do not require a previously established set of inputs (daily weather records, detailed soil and management information) as needed by modelling platforms like DSSAT (Jones et al., 2003), APSIM (Holzworth et al., 2018) or MONICA (Nendel et al., 2011). Another advantage lies in the possibility of estimating yields with daily, monthly or even yearly weather records. Additionally, there is the possibility of including categorical variables like soil type and level of management (low, mid and high) in the set of explanatory variables. Thus, due to the lack of detailed inputs for running process-based crop models (e.g. cultivar choice, sowing dates, planting density, fertilization rates, etc.) at large geospatial scales, data-driven models have appeared and tested as a valuable alternative for yield estimation at regional and national scales (Jiang et al., 2020; Lobell & Burke, 2010; Schwalbert et al., 2020).

In Brazil, several studies have investigated aspects associated to the sustainability (Sentelhas et al., 2015), impact of climate change (da Silva et al., 2021) and impact of management practices (Nóia Júnior and Sentelhas, 2019) of soybean crop often through process-based modelling approaches at point-basis and field experiments. On the other hand, fewer studies have assessed the impact of climate change and advances in agricultural technology in soybean cropping systems (Hampf et al., 2020), as well as the effect of economic and operational costs at soybean yields (Vera-Diaz et al., 2008) using process- and regression-based models, respectively. In addition, hybrid methods have used remote sensing products merged with agrometeorological models to estimate soybean yields at the regional-scale (De Melo et al., 2008; Silva Fuzzo et al., 2020), becoming a potential alternative for mapping yields at a fairly low cost.

The use of publicly available national databases towards large-scale assessments is not common. A potential source of long-term databases with information regarding agricultural information (including crop yield) is available at the Brazilian Institute of Geography and Statistics through the Survey of Agricultural Production website (IBGE/SIDRA, https://sidra.ibge.gov.br/). There, a range of crop production information (e.g. crop yield, harvested area, total production) is spatially aggregated from municipality to national-scale, and can be accessed for large periods.

In this study, we hypothesize that data-driven models are suitable tools for both early prediction and end-of-season soybean yields estimations at large scales based on publicly available agro-climatic information. In order to test our hypothesis, the performance of data-driven models feed with publicly available databases of soybean yields and agro-climatic data at county scale during 23 years (1996–2018) was investigated. We calibrated and validated data-driven models to estimate and further make predictions of soybean yields using an independent dataset. Thus, the objectives of this study were: (i) to evaluate the robustness of data-driven models for early prediction of soybean yields at 30, 60 and 90 days after sow (DAS), and further compare it with end-of-season (120 DAS) yield estimation in the main producing regions in Brazil; (ii) to investigate the suitability of the “best” data-driven model as a tool for predict the soybean yield for an independent year, based on publicly available databases of crop yield, weather and soil.

Material and Methods

Soybean Yield Database

A publicly available database containing county-scale soybean yield records (in kg ha–1) was downloaded from the Brazilian Institute of Geography and Statistics (IBGE) during 23 growing seasons (1996–2018). The raw dataset had initially records from 558 counties, and it was submitted through a quality control to identify and further remove suspicious and unrealistic yield records. The following steps were applied to the raw crop yield dataset: (i) counties with at least one missing year were removed; and (ii) counties with identical yield records in consecutive years were either removed, since it is unlikely that it happens, due to year-to-year variability of meteorological conditions throughout crop cycle.

Weather Database

Monthly weather data containing records of maximum and minimum air temperature (Tmax and Tmin, respectively, °C), and precipitation (Prec, mm) weredownloaded from the gridded weather database Worldclim (https://www.worldclim.org) for the period of 1995–2018. Worldclim is a monthly time-step product, downscaled from CRU-TS-4.03. The WorldClim data records are stored as GeoTiff files for the years 1960–2018, covering the whole globe at ~ 5-km spatial resolution. We downloaded the weather variables Tmax, Tmin and Prec, and cropped them spatially (counties selected) and temporally (November to February, during 23 years of analysis) using Quantum-GIS software.

Soil Database

Soil information were taken from the SoilGrids database (https://www.isric.org/)—a widely used soil information database for agro-ecological modelling studies. Soil characteristics available in SoilGrids were generated based on machine learning approaches developed through circa 150,000 soil profiles around the world, which around 5,000 are located in Brazil (Cooper et al., 2005). SoilGrids raster files cover the whole worlds on a 250-m spatial resolution at 6 standard depths (0–5, 5–15, 15–30, 30–60, 60–100 and 100–200 cm). However, we downloaded the top soil (i.e. 0–30 cm) products related to soil texture (i.e. sand and clay content), in order to add soil characteristics to the inputs of the multivariate models. The soil information was accounted through the first 30 cm of depth (averaged weight) and further geospatially aggregated for each of the county unit.

Crop Cycle and Preprocessing of Explanatory Data

For simplicity of our analysis, a typical soybean cycle of 120 days (sow to harvest) was considered. Previous assessments have highlighted that soybean sown between October and November is likely to reduce yield losses due water deficit in Brazil (Battisti & Sentelhas, 2015; Nóia Júnior and Sentelhas, 2019). Therefore, we synthetically simulated the soybean growing period starting on 1st of November (305 DOY) for all 23 growing seasons considered (1996–2018).

Crop, weather and soil information were used as input data for building the models. The time series of crop yield were de-trended in order to minimize the effects of different agronomic characteristics (herein considered as technological level) that are not available at county-scale for whole Brazil, such sow dates, maturity groups, water management (rainfed and irrigated cropping systems) and other factors that might be tricky to easily find at large scales. Although there are several approaches in the literature to de-trend data series, we used the method described by (Heinemann & Sentelhas, 2011), where the yields are de-trended in relation to the last year in the time series. The last year of the time series theoretically represent the year where farmers use the most advanced technology in their fields, whereas the other years are de-trended based on that year, following the Eqs. 1, 2 and 3.

$${Y}_{regression}=a + bx$$
(1)
$${C}_{residual} = \frac{{Y}_{obs - {Y}_{predicted}}}{{Y}_{predicted}}$$
(2)
$${Y}_{detrended} = \left(1+{C}_{residual}\right)\times {Y}_{obs}^{n}$$
(3)

where “a” is the linear coefficient; b is the slope of the regression; x is the id number representing each of the years (1, 2, 3, …); Yregression is the yield calculated through the linear equation (kg ha–1) (Eq. 1); Cresidual is the relative deviation between the linear (Eq. 1) and observed (Yobs) yields; Ydetrended is the yield (kg ha–1) theoretically without effect of agronomic technology (i.e. driven only by environmental factors); and \({Y}_{obs}^{n}\) is the yield with theoretically highest technology from the observed time series.

The explanatory variables presented different levels of magnitude, and therefore we standardized them through max–min procedure (ranging between 0 and 1) for further feeding the data-driven models (Shahhosseini et al., 2021). The variables were chosen according to their widely known importance for driving crop yield and photosynthesis rates (e.g. air temperature) and soil–water availability (e.g. precipitation and soil texture) (Hatfield et al., 2001; Lobell et al., 2009; Monteith, 1977).

We choose to use easily available explanatory variables for feeding the models (e.g. air temperature, precipitation and soil texture) to make the methods useful for decision-makers, farmers and other stockholders whom maybe do not have access or familiarity in manipulating large-scale databases, being well aware that richer, public available information at global geospatial scale of weather, soil and crop-related data exists.

Data-Driven Models

The data-driven models tested herein follow different approaches, but they were generically adjusted to the natural logarithm of the observed yields as a function of the explanatory variables (Lobell & Burke, 2010), as presented in the Eq. 4:

$$log({Y}_{obs})=\mathrm{f}(\mathrm{weather~and~soil~information})$$
(4)

where Yobs is the observed yield (kg ha–1); weather and soil information are the inputs of the data-driven models spatially aggregated at county-scale. The estimated yields through the data-driven models were back-transformed through exponential function for future investigation of the model performance through the statistical metrics.

We investigated the performance of two widely used data-driven models: random forests (RF) and support vector machines (SVM). In addition, the performance of multiple linear regression (MLR) was also investigated and assumed as our baseline.

MLR is a widely used statistical technique, typically accounting to the linear combination effects of the input variables to explain the variations at the response variable. Due to its simplicity and handling use, it has been massively use for agronomic applications since last decades (Olson & Olson, 1986).

RF is a broadly used machine learning method based on the ensemble of multiple trees for resolve classification and regression problems (Breiman, 2001). This method is based on producing multiple random trees, which theoretically will “vote” in the most popular class within a given set of characteristics. For regression purposes, in particular, the output generated through the RF algorithm is the average output from all trees built. RF was implemented in R software (R Core Team, 2020), using the package “randomForest” (Liaw & Wiener, 2002). The RF models were trained using 100 trees (ntree = 100), since the error is almost constant beyond this number of trees (Figure S1), and the number of variables randomly sampled at each split equal to 3 (mtry = 3). The parameters “ntree” and “mtry” are included in the randomForest function used.

SVM are robust and largely used machine learning algorithms for both classification and regression problems. SVM is currently applied in order to find an optimal hyperplane that maximize the distance between samples (or classes), separating groups with similar characteristics (support vectors) (Cortes & Vapnik, 1995). SVM approach was also implemented in R software, through “e1071” package (Meyer et al., 2021), where a radial basis kernel was adopted. SVM models were built considering the default parameters, since the tuning functions did not show considerable improvement for the results from previous studies (Lischeid et al., 2022).

Modelling Strategies

Two steps were used to verify the potential suitability of data-driven models to predict soybean yields in Brazil. First, we built the models with explanatory variables being temporally aggregated at 30, 60, 90 and 120 days after sow (DAS), to verify how earlier the soybean yields could be predicted, according to the statistical performance of the models. Since the “best” model was determined, we performed a “leave-one-year-out” cross-validation (LOYOCV) strategy to predict soybean yield for each of the selected counties. Figure 1 shows the steps considered during our modelling process.

Fig. 1
figure 1

Flowchart showing the main steps used to build and evaluate the performance of data-driven models for estimate soybean yields at large-scale in Brazil

Early Prediction and End-of-Season Estimation of Soybean Yields

The robustness of the models was tested for early prediction of the soybean yields at 30, 60 and 90 days after sow (DAS). Furthermore, the end-of-cycle soybean yield (i.e. 120 DAS) was estimated and considered our “baseline” for checking how earlier the models would be accurately suitable for estimate soybean yields. The performance of the models was measured through the coefficient of determination (R2), to account how much of the variance of captured by the model fitted; the root mean squared error (RMSE, in kg ha–1), to determine the absolute error of the model, and by the mean-weighted RMSE (rRMSE, in %), in order to represent the relative error.

The data-driven models were build using a standard strategy for split the whole dataset (3450 samples) in training (2415 samples—70%) and testing (1035 samples—30%) subsets. The aforementioned selection was randomly performed 100 times, aiming to minimize potential effects of sampling selection. Since each iteration was completed, the model performance was determined for each of the models investigated (MLR, RF and SVM) through statistical metrics.

Leave-One-Year-Out Cross-Validation Approach (LOYOCV)

Once the “best” model (i.e. the model that showed the best statistical coefficients and the number of days after sow) was chosen, the LOYOCV approach was performed. Thus, we investigated whether a given model would be suitable for estimate soybean yields according to the environmental characteristics for a specific and independent year. The residues (i.e. difference between the predicted and observed yields) will be geospatially presented at county-scale for each of the years evaluated in this study, as well as the relationship between predicted and observed yields.

Statistical Metrics for Model Evaluation

The performance of the data-driven models was evaluated through standard statistical coefficients broadly used in agro-ecological modelling studies. In our study, further than the coefficient of determination (R2), the root mean square error (RMSE, kg ha–1) and the mean-weighted root mean squared error (rRMSE, %) were calculated to determine the robustness of the models regardless the choice of the samples, through the Eqs. 5 and 6.

$$\text{RMSE} \,({\rm kg \,ha}^{-1}) = \sqrt{ \frac{\sum_{i=1}^{\rm n}}{({Y}_{est} - {Y}_{obs i})}^{2} {{\rm N}}}$$
(5)
$$rRMSE\,(\% ) = 100 \times \frac{{RMSE}}{{\overline{{Y_{{obs}} }} }}$$
(6)

where \(\overline{{Y }_{obs}}\) is the average of observed yields.

Results

Selection of High-Quality Soybean Yield Datasets

Following the criteria described at the “Soybean Yield Database” section, a total of 150 counties remained (~ 27%) and composed our so-called “high-quality” soybean yield database. Thus, the data-driven models were fed with a total of 3450 records (150 counties × 23 years), where the soybean yield represented the response variable from our models. The average of soybean yields within the last five years ranged from less than 2000 to more than 3000 kg ha–1, averaging 3063.1 kg ha–1 (Fig. 1c). The geographic distribution of the soybean yields during the last 5-years of the time series evaluated in our study, as well as the yearly variability of soybean yield, and its frequency distribution are shown in Fig. 2.

Fig. 2
figure 2

Spatial variability of the average soybean yield (2014–2018) at the 150 “high-quality” counties (a); year-to-year variability of soybean yields and its technological progress (dashed line) throughout the period analysed in Brazil (1996–2018). The linear equation address the relationship between of crop yields and years, while the slope of the trend line (x) represents the general technological progress of soybean (45.7 kg ha–1 year–1) (b); and the frequency analysis of the soybean average yields (2014–2018) (c)

These results are likely to provide insights about how diverse and challenging can be the large-scale modelling of soybean yields, given different cropping systems, genotypes (maturity groups, harvest timing, diseases and drought resistances), environmental conditions (air temperature and precipitation patterns) and agronomic practices (sow and harvest dates, plant density, row spacing, fertilization types and rates) across the country. The average technological progress of soybean is 45.7 kg ha–1 yr–1 (Fig. 1b), but a large diversity in levels of technology can be seen in Brazil, ranging from 10 to 105 kg ha–1 year–1. This highlights the different cropping systems that soybean is carry out along the last decades along the country (Figure S2).

Performance of the Data-Driven Models: Calibration and Validation Steps

The narrow distribution of the statistical metrics during the calibration step strongly suggest that our models have high robustness, regardless the choice of the samples (performed 100-folds), despite early (30, 60 or 90 DAS) and end-of-cycle (120 DAS) scenarios. Nevertheless, during the validation step the distribution curves are more scattered. In Fig. 3, the coloured histograms shows the distribution of the RMSE metric (kg ha–1), given the choice of the samples for building the prediction and estimation models. Similar curves are shown in Figures S3 and S4 representing, respectively, the variability of R2 and rRMSE.

Fig. 3
figure 3

Variability of the RMSE (kg ha−1) for 100-fold choice of the calibration and validation subsets for building the data-driven models for predict and estimate soybean yields

The models showed a progressive increment on their performances, since the number of days systematically increased until the whole crop cycle (120 DAS). The MLR models presented the poorest performance probably due its linear approach. MLR yielded always RMSE greater than 500 kg ha–1 for calibration and validation steps, representing relative deviation slightly below 18% (Figure S3). On the other hand, the MLR visually presented the largest share of the RMSE curves overlapping each other, highlighting therefore the robustness of our models, regardless the choice of the samples for building them.

In contrast, RF and SVM machine learning models presented better results than MLR, possibly due their non-linear approaches and higher capacity to better detect patterns and relationships between explanatory and response variables. Only the earliest prediction scenario (30 DAS) generated RMSE greater than 500 kg ha–1 during the calibration step for both RF and SVM models. The other scenarios (60, 90 and 120 DAS), however, came with RMSEs usually ranging from 400 to 500 kg ha–1 (12–15%, Figure S3) and only few combinations yielded RMSE smaller than 400 kg ha–1 (< 12%, Figure S3) (SVM model). In the validation step, similarly to the MLR models, the distribution curves of RF and SVM had slightly larger variability (more scattered) for all the statistical metrics evaluated (Fig. 3, S3 and S4). Furthermore, few differences can be identified at the distribution of RF and SVM considering the validation RMSE curves, suggesting similar performances of these methods.

Performance of the Data-Driven Models: Choosing the “Best” Model

The curves representing in our scenarios (30, 60, 90 and 120 DAS) are very similar in their shapes and position relatively to the x-axis at the validation step. Nevertheless, the 60, 90 and 120 DAS curves representing RF models overlap apparently more than those from SVM models. Therefore, we selected RF for making yield predictions using the LOYOCV approach. Regardless the potential use of the models built with 60 DAS for predict soybean yields, there is only a tiny portion of those curves overlapping each other, suggesting a higher risk of highly skewed predicting soybean yields using models built up that early, matching with the most critical crop phases (flowering and grain filling) (Steduto et al., 2012).

Performance of the Data-Driven Models: LOYOCV Approach

At country-scale, the RF model was partially able to capture the effects of agro-climatic conditions on soybean yields, since the averages of predicted (at 90 DAS) and observed yields were nearly similar (Table 1).

Table 1 Overview of the yearly variability of soybean yields in Brazil

The RF model tended to underestimate soybean yields, where 15 years had negative residues (Table 1). The highest deviation is observed in 2005, where the residues achieved 650.6 kg ha–1 (25.6%). On the other hand, negative residues lower than – 400 kg ha–1 (~ 11%) were not observed, indicating a potential use of data-driven models for crop yield analysis at large scales using few input data for feed the models.

Additionally, the performance of RF model to estimate soybean yields for a particular and independent year is presented at county-scale for the 23 years evaluated in our study, where is shown the geospatial and temporal distribution of the residues (Fig. 4). In general, there is a large share of white (i.e. residues between ± 250 kg ha–1) or light-coloured (± 500 kg ha–1) areas throughout the years. In contrast, particular years such 2005 and 2006 come with predominantly darker-coloured (either green or brown) regions, indicating a poor performance of the model for those particular years (residues higher than 1500 kg ha–1). This underperformance of the model in those years can be associated with factors that were not considered as explanatory variables, such the occurrence of extreme climate conditions during the crop cycle, resulting in poor performance of the model to capture yield variation at those particular years (Figure S5). Also, the relatively short time series used for train the model and further make predictions might have only few years with those particular conditions. For example, the accumulated precipitation in southern Brazil during 2005, 2006 and 2012 was much lower (less than 30%) then the average precipitation during the simulated soybean cycle (1996–2018, Figure S6). Additionally, the maximum air temperature presented positive deviation in large part of southern Brazil, particularly in 2005, 2006 and 2014 (Figure S7), likely affecting yields (Hatfield & Prueger, 2015). Although soybean is unlikely to be affect due low temperature in Brazil, we also investigated its geospatial and temporal variability during the period assessed in this study (Figure S8).

Fig. 4
figure 4

Geospatial and temporal distribution of the soybean yield residues at the “high-quality” counties. Residues were calculated through the difference between estimated and observed soybean yields

Discussion

Technological Progress of Soybean Yields

In this study, we aimed to investigate the performance of data-driven models for early prediction and end-of-season soybean yield estimation at large scales in Brazil. Given the continental extent of the country, naturally there are several soybean production systems, in which farmers adopt different technologies at their fields and regions, yielding different technological advances across regions. Technological progress is typically associated to the gradual change in technology and management practices adopted by farmers in a given region over time (Figueiredo, 2016).

The yield dataset was de-trended to minimize the effects of technological progress along different regions, assuming a linear gain (in kg ha–1 year–1, Figure S2) for all set of high-quality counties evaluated. Nevertheless, regions with a high level of technology probably present non-linear genetic gains along the years, while other regions where soybean is expanding, farmers are forced to adopt more suitable practices (e.g. sowing date), use new cultivars or even replace old cultivars for others more adapted to the environmental conditions (Umburanas et al., 2022). Thus, we identified large variability of technological packages in Brazil, and therefore the technological progress averaged 45.7 kg ha–1 year–1 (Fig. 1b). However, since Brazil is a country with continental dimension, there is a broader range of technological progress in the soybean producing areas, varying from 10 to 105 kg ha–1 year–1 (Figure S2). That variability is likely to be related to the advances in plant breeding and introduction of modern genotypes at the commercial fields (Rogers et al., 2015; Umburanas et al., 2022). Also, management practices such optimized water use in soybean fields (da Silva et al., 2019), adjustment of sow dates to reduce the risk of crop failure due to water deficit on flowering and grain filling periods (Nóia Júnior & Sentelhas, 2019), and adoption of new cultivars adapted to the new agricultural frontiers such Amazon forest, might increase crop resilience under climate change scenarios (Hampf et al., 2020). Therefore, due to the broad range of factors that might affect yield gains through level of technology adopted by farmers, the yield dataset was de-trended (Figure S2). We used this, because our main goal in this study was to use the data-driven models for make short-term yield predictions. In this case, the impact of technology is unlikely to significatively affect yields as showed under long-term yield predictions, as demonstrated by Hampf et al. (2020) in Brazil.

Large-Scale Yield Simulation: The Data-Driven Models

Although several studies have applied data-driven models for scaling up crop yields at large areas (e.g. country) there is still several issues associated with the methodology used, especially regarding the models’ structure and parameterisation impacting the outputs and the optimal strategy to split the dataset for training and validation. In this study, we used maybe one of most common approaches regarding split the datasets in calibration and validation subsets: 70 and 30%’s. Paudel et al., (2022) also used the 70–30% subset ratios for split the dataset and further build machine learning models for investigate yield patterns and trends at six crops in nine countries in Europe successfully. Although these authors included explanatory variables related to crop phenology (i.e. vegetative and reproductive phases), the ranges in rRMSE (10–30%) were similar to the ones we found (9.2–41.5%). Using up to 28 explanatory variables for predict corn yields at 10 states in the USA Corn-Belt region, Jiang et al. (2020) tested several data-driven approaches, where their relative errors when RF model was evaluated ranged from 7–33%. Other approaches for data splitting were investigated in Germany, where the data-driven models (RF and SVM) were feed with weather data and process-based model outputs. Considering 90% of the dataset for calibrate the model and 10% for test, they were able for capture up to 70% of the crop yield variability at national-scale (Lischeid et al., 2022). This highlights further room for including explanatory variables such remote sensing products for instance from Landsat or Sentinel constellations, and potentially improve model accuracy. Another factor that supports our results in terms of model robustness is the sampling choice. The 100-fold sampling process that we selected was likely to reduce the skewness probability and increasing the random effects of sample choice. However, this is not so clear in most of the papers using machine learning for crop yield assessments.

Large-Scale Yield Simulation: Model Performance

Recently, regional analyses have been made using machine learning methods for crop yield prediction in Brazil (dos Santos et al., 2021; Fernandes et al., 2017; Schwalbert et al., 2020) The models that we tested, regardless their simplicity, performed similarly well as compared to previous studies using well-calibrated process-based models under experimental field conditions obtained. For example, using process-based simulation models, Battisti et al. (2017) tested the performance of different models and obtained RMSE ranging from 262 to 2010 kg ha–1, whereas we had the same metric ranging between 400 and 500 kg ha–1, when RF and SVM were used (Fig. 3). In central Brazil, (Carauta et al., 2017) assessed the performance of MONICA model under different field conditions, and coupled with a micro-agent simulation model (MPMAS), finding a RMSE around 480 kg ha–1 for soybean yield. In contrast, when using data-driven models, Schwalbert et al. (2020) coupled remote sensing indices (i.e. NDVI and EVI) with weather data for feeding machine learning models and predict soybean yield at typical soybean region in south Brazil, found RMSE figures varying around 390 to 570 kg ha–1 using RF models. These authors also identified large yield deviation in some years (e.g. 2005), highlighting the need of longer data series (i.e. where a large number of “atypical” samples are potentially found), and then the data-driven models can easily learn from this atypical condition. In “Cerrado” region, dos Santos et al. (2021) investigated the suitability of several data-driven models to estimate soybean yields in that region, finding out that RF showed the best performance. In that study, further than weather variables, they included crop phenology and outputs from soil–water balance, yielding RMSE often lower than 200 kg ha–1.

Uncertainties and Potential Improvements

Our results benchmark that data-driven models are powerful tools to predict and monitor crop yields and environmental impact assessment at large-scales with public available information. However, several aspects are likely to produce different perspectives in terms of model output uncertainties, and herein we addressed some of them. For example, lack of information regarding how does the crop yield dataset was collected and harmonized by IBGE system and detailed geospatial datasets regarding agricultural management practices (fertilizer rates, sow and harvest timing, impact of insects and diseases) that play a fundamental factor for determining crop production. On the other hand, datasets have been made available to characterize water resources and irrigation practices at global scale, although uncertainties associated to input data, changes in geographic distribution and lack of temporal and spatial pattern in some regions (for example, developing areas) should be considered (Siebert et al., 2015).

Regarding the choice of the weather data, many studies have shown that the source of meteorological data, as well as how it is aggregated has a significant impact on the modelled outputs (e.g. yield) (Hoffmann et al., 2016; Van Wart et al., 2013; Zhao et al., 2015). Here, we used the weather datasets available by WorldClim, which is a product from Climate Research Unit (CRU) at monthly time-step, and our models resulted in satisfactory results, given the coarseness that the analysis were carried out. However, further analysis are likely to be performed considering daily time-step weather products available at AgERA5 database (https://cds.climate.copernicus.eu/cdsapp#!/dataset/sis-agrometeorological-indicators?tab=overview), where “netcdf” files are available from 1979 to near-present covering the whole globe at 0.1° lat-long regular grid. Thus, inclusion of explanatory variables considering number of dry days, number of heat days, for example, are likely to be included in our analysis aiming to improve model accuracy.

Finally, it is very attractive the idea of coupling of remote sensing products and process-based crop simulation models—so-called hybrid models—for large-scale yield monitoring. Nowadays, cloud platforms such Google Earth Engine are fundamental for accurate large-scale assessments that rely on land monitoring, allowing rapid access and processing of remotely-sensed satellite-derived products. Recent approaches have successfully used hybrid approaches for investigate the main factors that drive the yield variability of corn in the USA merging vegetation indices and outputs from validated crop simulation models (Deines et al., 2021; Kang et al., 2020; Lobell et al., 2015). As previously mentioned, studies approaching the use of hybrid models for yield prediction in Brazil are rare. Part of it is due to the lack of observed input datasets for calibrate and validate process-based simulation models beyond the experimental fields located at the research or universities centres. Those tools have proved to be suitable for generate “pseudo” yield observations at fine scale, further than other potential explanatory variables for build data-driven models. Hence, that kind of hybrid approaches are likely to be considered in future analysis of short-term crop yield monitoring at large-scales, since factors that control plant growth, development, water and health status can also be monitored through those products at fine spatial and temporal resolutions. Furthermore, the hybrid approaches seem highly interesting, since it might provide benefits from the capacity of process-based models to systematically generate crop yields for long-term future scenarios, what we cannot have only with data-driven models.

Conclusions

The main findings of this study highlighted the potential use of data-driven models for crop yield prediction at large scales given the publicly available databases in Brazil, which few studies have had explored those datasets for similar purpose. Although RF and SVM models showed a certain robustness for predicting soybean yield (R2 from 0.17 to 0.68, nRMSE ranging from 9.2 to 41.5%) already at an early stage (90 DAS), it was a general analysis, where publicly available datasets were considered to explain the spatial and temporal variation of soybean yield in Brazil. Therefore, there is still room for enhancing the accuracy of these models through the integration of more complex sources of data (i.e. remote sensing products). In addition, hybrid approaches—for instance, combining outputs from process-based crop models (e.g. growing degree-days, flowering dates) and environment characteristics associated to extreme climate events, number of dry and heat days during the cycle, might be a valuable option for increase the accuracy and usefulness of data-driven models. Additionally, the inclusion of remote sensing products like vegetation indices (e.g. NDVI, EVI), which are available at cloudy platforms might be an alternative for increase the explanation power of such models we used in this study. Hence, combinations of data-driven and process-based models with real-time sensor data may become an interesting approach for enhanced crop yield monitoring and improved development of decision-making strategies at large scales in a near future.