Introduction

The ability to forecast final crop yields before and during a growing season is a powerful tool for farmers. If yield can be predicted with greater accuracy, there is the opportunity to plan ahead, ultimately reducing business risk (Kantanantha et al. 2010). Forecasted yields are also valuable in guiding management decisions such as the application of fertiliser (Raun et al. 2001). The spatial resolution of these yield predictions is a crucial component, as this allows management to be tailored to different fields within a farm, or at the sub-field level. Traditionally, farmers and their advisors estimate their yield goals based upon previous experience and seasonal conditions, and then used this as a guide to construct management decisions (Dahnke et al. 1988; Raun et al. 2001). Given that crop yield is controlled by the interaction between management, soil and weather, this yield goal should vary from season to season, but also vary from location to location.

More systematic, quantitative approaches to predicting crop yield typically use mechanistic/simulation models, such as Agricultural Production Systems sIMulator (APSIM) (Keating et al. 2003) or Decision Support System for Agrotechnology Transfer (DSSAT) (Jones et al. 2003). While these mechanistic models can be useful in certain circumstances, there are several limitations. In these approaches, many underlying assumptions are made, a large selection of inputs is required, and the model is often not well-suited to the study area of interest. It is also possible to forecast yield using empirical, data-driven approaches. Examples from the past have typically used crop reflectance data from remote or proximal sensing platforms during the growing season to make a prediction of final crop yield (e.g. Raun et al. 2001; Boydell and McBratney 2002). While many of these empirical approaches have traditionally ignored other climatic and geo-physical variables, more recent research has included a larger suite of data sources with this remotely and proximally sensed data (e.g. Balaghi et al. 2008; Walsh et al. 2013). As an example, Walsh et al. (2013) combined data from a GreenSeeker sensor with soil moisture data to improve yield forecasts and hence improve recommendations of nitrogen fertiliser application. Including these additional data sources can be invaluable, and the wealth of agricultural and environmental data available today is an exciting and promising opportunity to further improve yield forecasting through empirical approaches.

In modern broadacre agricultural systems, farmers typically have a store of spatial agricultural data. This often includes a time-series of crop yield monitor data, which is frequently augmented with auxiliary spatio-temporal data, such as soil test results, apparent electrical conductivity (ECa) and gamma radiometrics surveys. A survey of Australian farmers by Bramley and Ouzman (2018) found that 77% of respondents use a yield monitor when harvesting, and 50% of respondents generate yield maps. The use of high-resolution soil surveying is less common, with 26% of respondents adopting this. This data is highly valuable but is often underutilised due to various limitations, such as being in different formats, located in a variety of repositories, and consisting of different spatial and temporal resolutions. There is often disconnect between these different data sources, and they are rarely combined and typically viewed in isolation. Not only is agricultural data collected on-farm becoming increasingly available, but publicly-available spatial and temporal environmental data is also becoming more available at finer spatial and temporal resolutions, and at declining costs. The nature of these publicly-available datasets is diverse, and includes remotely sensed imagery, geophysical and climate data.

Theoretically, every agricultural crop can be considered an experiment, where the yield is a function of the interaction between a suite of variables that vary in space and time. This abundance of data collected on-farm, along with the publicly-available data, describe the conditions under which crops are grown and the response in terms of yield. Machine learning techniques are well-equipped to deal with large datasets with many variables, and they provide the opportunity to develop predictive models of crop yield using this multidimensional data. While traditional agronomy and precision agriculture (PA) has typically been concerned with examining single fields in isolation, there is now the opportunity to explore the value of combining this data over multiple fields and years into one dataset and model. Such an approach would utilise historical and in-season information from neighbouring fields to guide yield forecasts within each field and has the potential to stimulate a paradigm shift in precision agriculture.

Therefore, rather than focusing on single fields in isolation, this study collates spatial and temporal data gathered on-farm and relevant publicly-available data for a series of farms in the southern agricultural region of Western Australia (WA). These datasets are then combined with random forest models to create predictive models of yield for wheat (Triticum aestivum L.), barley (Hordeum vulgare L.) and canola (Brassica napus L.) in a crop rotation program. The predictions are made at three time points in the growing season that are important for agronomic management intervention or marketing.

Methods

Study area and period

The study was conducted on three large aggregations (A, B and C) of several farms owned by a single corporation that are located in the southern agricultural region of Western Australia (Fig. 1). The majority of the vegetation in this region of WA has been cleared for agricultural purposes, although the study areas are close to the Lake Magenta Nature Reserve and Fitzgerald River National Park. The soils of the area are typically sandy with notable amounts of gravel, and one of the most widespread soil types are Sodosols (classified according to the Australian Soil Classification). Dryland winter cropping is the sole enterprise in the study regions, with wheat, barley and canola grown in rotation. The study area has a Mediterranean climate, with cool, wet winters and hot, dry summers and the average annual rainfall of the aggregations ranges from 420 to 533 mm. This study uses data from the 2013, 2014 and 2015 growing seasons, with notable variation in the amount of total rainfall received between each year (Table 1). During this period, total annual rainfall values ranged from 389.0 mm to 687.4 mm for the different aggregations (BOM 2017a, b).

Fig. 1
figure 1

Location of the three aggregations of farms (A, B, C) in the southern region of Western Australia used in this study

Table 1 Total annual rainfall (mm) for the different aggregations of farms for 2013, 2014, and 2015

Yield forecasting approach

There are four primary steps to obtain a yield prediction/forecast using the proposed empirical modelling approach (Fig. 2). The first is the gathering of raw data, which vary in their type, format, and spatial and temporal resolution. The data available for this study included high-density data that varies only in space (e.g. gamma radiometrics), high-density data that varies in space and time (e.g. in-season imagery), and lower resolution data that varies in space (e.g. soil physical test results) or space and time (rainfall). This raw data is then processed, and feature extraction is performed, if required, to transform the data into a more useful format. For example, soil test results used in combination with covariates was used to create a map of soil properties. The next step is to collate this processed data together into a space–time cube (STC). This STC is then combined with machine learning approaches to create a parameterised, predictive model of crop yield which can then be used to forecast yield (Fig. 2). This approach is explained in greater detail in the subsequent section.

Fig. 2
figure 2

The workflow pattern and broad steps of the empirical approach used to forecast crop yield

Datasets (space–time cube) and processing

A variety of spatial and temporal data collected on-farm, and publicly-available environmental and agricultural data for the whole study area was collated into a space–time cube (STC). The data were of varying spatial and temporal resolutions, and consisted of yield monitor data, soil texture information, soil apparent electrical conductivity (ECa) and gamma radiometrics surveys (collected on-farm), remotely sensed information and climate data (publicly-available).

Yield from wheat, barley and canola crops from three different seasons that covered 10 587, 16 001 and 16 755 ha in 2013, 2014, and 2015, respectively, were used (Table 2). The amount of yield monitor data for each crop varied by each season and aggregation, with the most yield data available for wheat, followed by barley, and then canola (Table 2). The yield data were corrected and standardised using field total yields measured at the silo after harvest and then processed following the method of Taylor et al. (2007) to a 10 m resolution.

Table 2 Number of hectares (ha) of yield monitor data for each crop type and season within each aggregation

The predictor variables that were available are listed in Table 3. Each aggregation had ECa and gamma radiometric surveys to 10 m resolution performed by a single consulting company, ensuring some consistency. Soil physical test results from 72 point samples had been previously collected using a random sampling scheme at a variety of sampling depth intervals. Equal-area quadratic smoothing splines (Bishop et al. 1999) were used to standardise the observation depths to a single value of 0.5 m. The SCORPAN approach (McBratney et al. 2003) was then applied to create digital soil maps of both clay and sand content, using Easting and Northing, ECa survey data (shallow), and radiometric Thorium (Th), Potassium (K), Uranium (U), and total count as covariates in a random forest model. As the B and C aggregations were in close proximity, the point soil data collected on these aggregations (32 sampling points) was pooled together to create one model. A separate model was created for aggregation A using soil data from the 40 sampling points extracted on aggregation A. The accuracy of the out-of-bag predictions was tested using Lin’s concordance correlation coefficient (LCCC). These models were then used to predict the textural data on to a 10 m grid of each corresponding aggregation.

Table 3 Data sources used in the space–time cube to create the three predictive yield models; April (pre-season), July (mid-season), and September (late-season)

MODIS 16-day Enhanced Vegetation Index (EVI, MOD13Q1) composites at 250 m resolution were acquired from the NASA Land Processes Distributed Active Archive Centre (LPDAAC) portal for the whole study area (https://lpdaac.usgs.gov/, last accessed 21 November 2017). Unlike the Normalised Difference Vegetation Index (NDVI), EVI does not saturate at high canopy density (Huete et al. 2002), and as such it is better suited as a surrogate of vegetation vigour in high input cropping systems. Composites closest to mid-July and mid-September were selected to reflect vegetation conditions at mid- and late growing season. In general, remotely sensed images closest to the middle of September are the most accurate for final yield predictions of winter wheat in the Southern Hemisphere, as this is when flowering and grain-filling occurs (Lyle et al. 2013). Total daily rainfall (mm) maps for the study area were obtained from the Bureau of Meteorology (BOM), and this was then aggregated from January 1st–March 31st, April 1st–June 30th, and July 1st–August 31st inclusive, to be used as predictor variables (BOM 2017c). In addition, a forecast probability of rainfall exceeding the median rainfall for the ensuing 3 months was available (BOM 2017d). The dates for EVI, aggregation of rainfall, and the seasonal forecasts were chosen to coincide with agronomically and managerially important points in the winter cropping season, e.g. sowing (April), mid-season N-fertiliser top-dressing allocation (July), and anthesis (September).

Predictive yield modelling

Random forests (Breiman 2001) were used in conjunction with the STC to create predictive models of crop yield. Rather than creating individual models for wheat, barley and canola, one model was created and crop type was included as a predictor variable. Three models were created based on pre-sowing (April), mid-season (July), and late-season (September) conditions to explore the changes in the predictive ability of the model as more within-season information became available. The models were built at a 100 m resolution, and predictions made at the same 100 m resolution. This was then aggregated up to the field resolution, and the prediction quality was then assessed at the field spatial resolution.

The quality of the model predictions were assessed using cross-validation techniques. The first (1) cross-validation method involved creating a model with all seasons of yield data for all fields in the study area, but without all seasons of yield data for a particular field, and then using that model to predict the yield for that field for the missing years. This is identified as leave-one-field-out cross-validation (LOFOCV). The second (2) cross-validation method involved creating a model with all seasons of yield data for all fields in the study area, but removing only 1 year of yield data for a particular field (prior/other yield data retained in the model), and this model was then used to predict the yield for that field in the missing year. This is identified as leave-one-field-year-out cross-validation (LOFYOCV). In both instances, this was repeated for all fields and years, and the average statistics were calculated. It should also be noted that random forest models were re-fitted for each data splitting iteration. The aim of performing these different cross-validation techniques was to determine the prediction quality for predicting in a new field with no prior data, as opposed to the predicting at a field with prior yield information included in the model. The root mean square error (RMSE) and Lin’s concordance correlation coefficient (LCCC) was used as an assessment of model quality. The LCCC is the fit of the observed and predicted values to the 1:1 line, and is unit-less, making it useful for comparing between models where the magnitude of the predictions may vary (Lin 1989).

Results

Yield modelling predictions

Predictions at the field resolution had a LCCC ranging from 0.19 to 0.27 for the LOFOCV technique, and ranging from 0.89 to 0.92 for the LOFYOCV technique (Table 4). As the season progressed, the models performed slightly better, with the September models possessing the lowest RMSE, and the highest LCCC (Table 4). The significantly improved predictions of the LOFYOCV technique show the important benefit of including prior yield information for a particular field. As an example, cross-validated predictions for the July model improved from an LCCC of 0.20 when no prior yield information was included for a particular field, to 0.91 when prior yield information for the field being predicted was included (Table 4; Fig. 3).

Table 4 Quality of crop yield predictions at the field resolution with the LOFOCV and LOFYOCV approaches
Fig. 3
figure 3

Observed and predicted yield for the July (mid-season) model for all fields and years using the LOFYOCV approach at the field resolution (asterisk: the scale was altered to range from 0 to 100 for privacy reasons)

The value of including prior yield information for a paddock is also supported by Fig. 4, which shows that as more seasons of prior data were available for an individual field, the predictions substantially improved. This figure was created by using cross-validation, where all data available was used to create a model and this was then used to predict yield for a particular field in a particular year (with that field and year removed from the model). For fields that contained only 1 year of data (a) (no years of prior yield data in the model), the LCCC was 0.46, and this improved to 0.89 for fields with 2 years of yield data (b) (1 year of prior yield data in the model), and 0.94 for fields with 3 years of yield data (c) (2 years of prior yield data in the model) (Fig. 4).

Fig. 4
figure 4

Cross validated observed and predicted yield for fields that had a 0, b 1, and c 2 years of prior yield data in the model (asterisk: the scale was altered to range from 0 to 100 for privacy reasons)

Predictor variables and importance

Both clay and sand content were predicted with high accuracy for both the A and B/C aggregation models, with an LCCC ranging from 0.86 to 0.89 based on the out-of-bag predictions (Table 5).

Table 5 The Lin’s concordance correlation coefficient (LCCC) of out-of-bag clay and sand content predictions for the A and B/C aggregations

To test the importance of different predictor variables in the model, the mean decrease in accuracy, based on the mean square error (MSE) was calculated (Fig. 5). The mean decrease in accuracy shows the amount the random forest model prediction accuracy would decrease if that particular variable is excluded. The larger the mean decrease in accuracy for a predictor variable, the more important that variable is deemed. Using the July model as an example, crop type was the most important predictor variable (Fig. 5), and this is expected due to the inherent differences in typical yield and yield potential between wheat, barley, and canola. Within-season variables proved to be vital in the models, with received rainfall, forecasted rainfall, and within-season EVI images amongst the most important covariates. The soil maps and geo-physical data (EM and gamma radiometrics) were less important predictors (Fig. 5).

Fig. 5
figure 5

Predictor variable importance from the July (mid-season) yield model

Discussion

Model testing and cross-validation approach

Overall, this empirical approach to predict wheat, barley and canola yield showed promising results. A down-fall of both cross-validation approaches is that information from other fields for the same year is included in the model. Removing all yield data from the same season was restrictive—a leave-one-year-out cross-validation approach (LOYOCV), as only three seasons of yield monitor data were available. Furthermore, the expanse of crops grown in each season varied considerably within the different aggregations (Table 2). For example, in the 2015 season, canola yield data were only available for the B aggregation, and canola was never grown in the A aggregation in any season. This limitation also prevented leave-one-year-out cross-validation in a study by Donohue et al. (2018), which predicted grain yield using a simplistic approach consisting of a carbon turnover model and a light use efficiency carbon assimilation model. Despite this constraint in the present study, the contrasting results from the different cross-validation techniques suggest that the accurate predictions from the models are not due to the inclusion of yield data from other fields for the prediction year. In the LOFYOCV approach the predictions were accurate (LCCC of 0.89 to 0.92), however, in the LOFOCV approach the predictions were very poor (LCCC of 0.19 to 0.27) despite the model including data from other fields for the year of prediction (Table 4). While this is not completely robust, it is an indication that prior yield data for the prediction field is the driver for the high-quality predictions achieved by the second cross-validation approach, rather than data from other fields from the same year being included in the model.

Dataset size and resolution (spatial and temporal)

It was clear that including prior yield data for a field resulted in more accurate yield predictions, and this is logical, as the model is ‘trained’ to an ‘expected’ yield for that field. This may be due to the consistency of winter crop yield patterns between years, for example high yielding areas are likely in the same location in each of the 3 years. This scenario is more likely in regions like southern WA with reliable winter-dominant rainfall. Consequently, a model then needs to scale the yield values from year to year based on seasonal conditions (weather) and in-season observations (EVI). These findings suggest that a larger time-series of yield monitor data, such as 10 or 15 years would greatly improve the prediction accuracy of yield. This would be possible to test in future studies, as yield monitoring at harvest in the grains industry has been widespread in Australian agriculture for about two decades. While there was some climatic variability between the 2013, 2014 and 2015 growing seasons that were used to develop the yield model, a greater time series would allow the diversity of growing conditions to be better represented in the model. This would also permit whole-years to be removed from the training model during cross-validation, which would give a more accurate depiction of actual model performance for yield forecasting.

While it is obvious that a greater time-series of data would improve model predictions, the impact of the spatial extent of the dataset on prediction quality is less clear and needs to be further explored. In this study crop yield was predicted for a collection of large farms that covers a large area, but future work should focus on whether this approach can be used on smaller spatial domains—e.g. for a single farm. The possibility of data sharing among neighbouring growers is an option, but this presents some challenges. It may be ideal to have one model for a region, or it may be better for individual farms to have a specific model, and the concept of an ideal area that a model covers should be evaluated. For example, it would be valuable to explore whether 10 years of yield data for a 2000 ha farm results in better predictions than 3 years of yield data for a 15 000 ha area. As the spatial domain of the model increases, caution should be taken regarding the variables included. For example, a model created using data from a few different regions would likely include more crop varieties with different inherent yield potential, and it would be expected that variety would be a necessary predictor variable in this situation.

The models in this study were built at 100 m resolution, predicted at 100 m, and then aggregated to the field resolution, but there are obviously opportunities to refine this. The finest spatial resolution of the variables used was 10 m (yield, EM and radiometrics), and it is possible to build the STC at this resolution, however, 100 m was chosen as this was an exploratory study to test a methodology and therefore make cross-validation calculations manageable on a desktop computer. While yield predictions at the field resolution are valuable, predictions at finer resolutions within-fields, such as 10 m or management units/zones would be much more useful to guide management decisions and implement spatial precision agriculture (Bishop and Lark 2007; Taylor et al. 2007; Bishop et al. 2015), such as the variable rate application of fertiliser. The changes in yield forecast quality at different spatial supports will be explored in future work.

Predictor variables and feature extraction

As the season progressed the model predictions slightly improved and this is likely due to more within-season predictor variables being used in the models. The variable importance plot showed that within-season variables were very important predictors in the models, and integrating more of these within-season data sources should be considered in further research. The remotely-sensed EVI images used in this study were sourced from MODIS and are at a 250 m spatial resolution, however there are now opportunities to include freely-available data at finer spatial scales, such as Landsat at 30 m or Sentinel 2 at 15 m resolution, and this would give more detailed information (Lewis et al. 2017). Soil moisture information would also likely be a valuable covariate in the predictive yield models, and soil moisture products are now becoming more readily available at finer spatial and temporal resolutions, particularly through the use of the Sentinel satellites (Torres et al. 2012). Both received rainfall and forecasted rainfall were highly important variables in the models, and the inclusion of other climatic data variables, such as temperature could also improve the model predictions. The growing degree-days (cumulative of the average temperature in a day) are a useful way of measuring the physiological development of a crop, and this could also provide useful insight into the expected final crop yield (McMaster and Wilhelm 1997).

The management choices and practices growers implement have a strong impact on final crop yields, and this type of information should be included in the modelling approach in the future. This could include variables such as the crop variety, sowing seeding rate, or amount of applied fertiliser. This would also allow the analysis of how yield would vary according to different management inputs. For example, if a sufficient range of applied fertiliser rates are included in the model this would allow different rates to be selected to test how crop yield would vary with fertiliser amount.

Furthermore, additional research should consider the quality of the models under data-poor scenarios, such as when only freely-available datasets are available, and data-rich scenarios, such as when there is an abundance of data collected on-farm available. This could identify the value proposition for growers when deciding on the type of data to collect, as well as the optimal spatial and temporal resolution. In this study, the variable importance plot (Fig. 4) showed that the soil maps and geo-physical data (soil ECa and gamma radiometrics) were not highly important predictors, but many of these variables were highly correlated with each other. This would mask their combined significance and it is likely that the importance of these types of predictors would increase if some were removed.

Comparison with other approaches to forecast yield

Currently, most approaches to predicting crop yield are through the use of mechanistic/simulation models, such as Agricultural Production Systems sIMulator (APSIM) (Keating et al. 2003). The disadvantages of mechanistic models are that they generally require numerous inputs and there are many assumptions made. The advantage of the empirical approach used is that real, on-farm, and within-season data is used to drive predictions, allowing fewer assumptions to be made. A significant advantage of the modelling approach is that variables of different spatial and temporal resolutions can be combined. However, a downfall is that when a particular covariate is included in the model, the information has to cover the whole study area. This could be a likely problem for data sources collected on-farm, for example, not all growers would have had their property surveyed for ECa. In cases such as this, the trade-off between increased data richness and decreased size of the dataset used to build the model should be evaluated.

Potential use for yield forecasts

Predictive models of the upcoming season’s crop yield are useful, particularly when the predictions are at fine spatial resolutions. There is a number of potential uses of the models presented in this study. However, further development of this approach with denser data, a longer time-series of yield, and more realistic cross-validation is crucial before this could be implemented in-field. Some future possibilities include deciding on futures contracts and market speculation, and identifying yield gaps. There has been considerable interest in variable rate application of fertiliser to increase efficiency and profitability (Stefanini et al. 2018), and the proposed yield forecasting approach has the potential to inform these decisions. The incorporation of management inputs with these models is a promising avenue for future research, and may assist with variable rate application of fertiliser, gypsum, lime, seeding rates, crop type, or variety selection.

Conclusion and future directions

In this study, a preliminary, data-driven approach to predicting wheat, barley, and canola crop yield using only on-farm data and freely available external data were presented. The results from this approach are promising, and the generic nature makes it possible to apply it to many other agricultural systems where yield data is available. Particular benefit was found from including prior yield information for fields, and future work should explore the change in model quality predictions with a larger time-series of yield data (e.g. 10 years). This will also allow leave-one-year-out cross-validation to be performed. Furthermore, the optimal spatial domain coverage for this modelling approach should be investigated to test whether data pooling among farmers within a similar region (as done here) is preferable to constructing a single model for each individual farm. Future work should also explore integration of more freely available data sources and improved feature extraction/layer calibration. Of particular interest would be an increased incorporation of within-season measurements, such as growing degree-days, remotely-sensed information at a finer spatial resolution, and soil moisture products. Forecasting the yield at finer spatial resolutions within-fields, such as a 30 m grid or management zones, would also be desirable.