Introduction

The rapid population growth and unpredictable climate changes present significant challenges to the agricultural sector, particularly in terms of ensuring food security, productivity, and sustainability. Climate change has emerged as a critical concern affecting a country's food security, leading to extreme weather events. With temperatures projected to rise by 1-2.5°C by 2030, crop yields could be substantially impacted due to changes in photosynthesis, increased plant respiration, and alterations in disease incidence and pest populations (Bhanumathi et al. 2019).

Crop diseases are heavily influenced by weather conditions. The disease triangle, a conceptual model, outlines the fundamental factors responsible for causing disease. It explains that diseases occur when a virulent pathogen interacts with a susceptible host organism under favorable environmental conditions. The disease triangle was first depicted by Stevens in 1960 and was later revisited by Francl in 2001, who expanded the concept to include additional parameters such as humans, vectors, and time. Numerous researchers have dedicated considerable efforts to studying how various weather parameters interact and indirectly impact the development of plant disease outbreaks. These studies underscore the importance of incorporating multiple climate change parameters into models addressing this issue. Newbery et al. (2016) introduced a graphical scheme to facilitate a more concise understanding of how climate, crop growth, and disease models can be integrated to project crop growth stages and disease incidence/severity under different climate change scenarios.

The issue of effective plant disease protection is closely linked to the challenges of sustainable agriculture and climate change. Climate change has the potential to impact pathogen development stages and rates, as well as alter host plant resistance, resulting in physiological changes in the interactions between hosts and pathogens. These changes can have significant implications for the occurrence and management of plant diseases (Garret et al. 2006). Several minor diseases have reappeared during different crop seasons. For instance, Karnal bunt, a significant wheat disease in Punjab, exhibited an upward trend in both severity and prevalence between 2012-13 and 2014-15 (Kaur et al. 2018). Despite being considered a minor disease, Karnal bunt resurfaced during the 2014-15 crop season due to the presence of favorable weather conditions (Sharma et al. 2012). Smiley (1996) emphasized the significance of specific climatic conditions, including appropriate rainfall and associated humidity levels, which play a crucial role in teliospore germination, secondary sporidial multiplication, penetration, and infection. These events need to be synchronized with the susceptible period, typically spanning 3 to 4 weeks leading up to wheat anthesis. The defined suitable rain and humidity events involve measurable rainfall (> 3 mm) occurring on two or more successive days, with at least 10 mm collected within a 2-day interval, and an average daily relative humidity exceeding 70% on both rainy days. To summarize, the climatic conditions during the susceptible period before wheat anthesis, which facilitate the survival, establishment, and spread of Tilletia indica sporidia, include optimum maximum temperatures ranging from 16 to 23°C, optimum minimum temperatures ranging from 7 to 11°C, high average daily humidity (> 70%) or relative humidity exceeding 48% at 3 pm, and measurable rainfall on multiple successive days (Smiley 1996). In context of Punjab, the favorable conditions for Karnal bunt were determined to be a maximum temperature ranging from 25 to 31°C in March, a minimum temperature ranging from 8.5 to 11.0°C in February, morning and evening relative humidity ranging from 85 to 95% and 40 to 60%, respectively, in March, and sunshine hours of 5.5 to 9.0 hours per day, along with rainfall exceeding 25 mm during mid-February to mid-March (Sandhu et al. 2022).

The liberalization of trade has facilitated the global spread of diseases, leading to the emergence of new diseases in previously unaffected regions where there may be a lack of local expertise to deal with them. Inadequate pesticide usage can also lead to the development of long-term resistance in pathogens, making it difficult for host plants to defend themselves. As a result, timely detection of diseases in plants poses a significant challenge for farmers (Anonymous 2019).

One potential solution to address this challenge is the development of prediction models based on the relationship between prevailing weather conditions and disease severity. By studying the complex interplay between plants, pathogens, and the environment, such models can aid in guiding management decisions. However, the complexity of many plant diseases and their dependence on mathematical or statistical forecasting models can be limiting. Although numerous laboratory studies have been conducted to understand the impact of environmental conditions on the survival and growth of T. indica (Smilanick et al. 1989), and several models have been created to simulate meteorological factors relevant to the establishment and spread of the disease (Jhorar et al. 1992, Mavi et al. 1992, Kaur et al. 2000, Sandhu et al. 2022), the task remains challenging due to the intricate nature of the disease processes involved. Several attempts have also been made to model KB forecasting in Indian conditions (Srinivasan 1980, Jhorar et al. 1992, Mavi et al. 1992; Singh et al. 1996) but all the models could not be validated in Punjab, India (Kaur et al. 2006). Biswas et al. (2013) carried out bivariate probability density analysis to develop a predictive regression model for inoculum load that was successfully validated and could be used for prediction of sporidial activity in field. Much of the recent progress has come from advances in computing and storage capabilities that are expected to improve complex computing systems that can learn to mimic humans and perform specific tasks. Biswas et al. (2013) carried out bivariate probability density analysis to develop a predictive regression model for inoculum load that was successfully validated and could be used for prediction of sporidial activity in field.

Much of the recent progress has come from advances in computing and storage capabilities that are expected to improve complex computing systems that can learn to mimic humans and perform specific tasks. Artificial Intelligence (AI) highlights the potential usefulness of pattern and trend detection in large amounts of data using pertinent mathematical algorithms and the objective of solving a particular task (Winston 1992). The task can be generic, such as computer vision, natural language processing (NLP), predictive modelling, or specific and related to a specific area that would otherwise require an expert in the field. AI includes field of Machine Learning (ML) and Deep Learning (DL). While, AI is the general term used to categorize any task that allows a machine or system to mimic human behaviour and intelligence, machine learning and deep learning are the specific methods used to do so. Machine learning uses algorithms that learn from data, make generalizations, and create rules that enable prediction of one or more target variables on the basis of set of input variables (Goodfellow et al. 2016). Fortunately, machine learning not only helps in understanding and developing new models but also accounts for understanding highly complex relationships to define with mathematical models (Fu et al. 2018). Thus, keeping this aspect in view this article proposes improved ML algorithms that use specialized ensemble methods such as artificial neural networks (ANN), efficient neural network (ENET), k-nearest neighbour (kNN), least absolute shrinkage and selection operator (LASSO), Bayesian least absolute shrinkage and selection operator (BLASSO), support vector regression (SVR), ridge regression (RIDGE), Bayesian ridge (BRIDGE), multiple linear regression (MLR), principal component regression (PCR), and random forest (RF) for developing prediction models for Karnal bunt disease.

Data methodology

Study area

The study was conducted at two locations viz. Ludhiana (latitude 30o54’, longitude N 75o48’E and at an altitude of 247 meter above mean sea level) and Bathinda (latitude 30°58’N, longitude 74°18’E, altitude 211 meter above mean sea level. Ludhiana is located in the central plain region of Punjab with general climatic conditions classified as sub-tropical and semi-arid while Bathinda region falls in western zone and its climate is classified as semi-arid. Annual normal rainfall levels of Ludhiana and Bathinda are 760 mm and 436 mm, respectively. In Ludhiana, the summer temperature exceeds 40°C with dry summer spell while the lowest temperature may be near 0°C during winter season. In Bathinda, dust storms are common in May-June when the mercury touches 47°C and frosty nights associated with chilled winds are common when night temperature touches 0°C during December-January.

Disease data collection

The data on Karnal bunt incidence for 12 crop seasons (from 2009-10 to 2020-21) in Ludhiana district and 9 crop seasons (from 2010-11 to 2018-19) in Bathinda district was obtained from the Wheat Section of the Department of Plant Breeding and Genetics at PAU, Ludhiana (Fig. 1). To gather the Kb incidence data, wheat grain samples were collected from various grain markets in both districts. Approximately 15-30 samples of grains, each weighing between 500g to 1000g, were randomly collected from different wheat heaps and placed in paper bags.

Fig. 1
figure 1

Karnal bunt incidence data of Ludhiana and Bathinda

$$\text{Disease incidence} \left(\%\right)=\frac{\text{No. of infected grains}}{\text{Total no. of grains examined}}\times 100$$
(1)

Meteorological data collection

The meteorological data for the respective districts (Figs. 2, 3 and 4) was collected from the Department of Climate Change and Agricultural Meteorology at PAU, Ludhiana. This weather data included maximum and minimum temperatures (Tmax and Tmin), mean relative humidity (RHme), rainfall (RF), and the number of rainy days (RD) for the months of February and March. These specific months were chosen as they correspond to the anthesis and ear formation stages of wheat, which are considered the most vulnerable stages for the development of Karnal bunt.To begin the analysis, descriptive statistics were applied to the meteorological data. Subsequently, a Humid-thermal index (HTI) was developed to forecast the suitability for disease establishment and spread. The HTI was calculated using the following formula:

Fig. 2
figure 2

Maximum and minimum temperatures of February and March months of Ludhiana and Bathinda

Fig. 3
figure 3

Mean relative humidity of February and March months of Ludhiana and Bathinda

Fig. 4
figure 4

Rainfall and rainy days of February and March months of Ludhiana and Bathinda

$$\text{Humid-thermal index}=\frac{\text{Evening relative humidity}}{\text{Maximum temperature}}$$
(2)

Results of HTI are interpreted as per Table 1 in Fig. 5.

Table 1 Forecasting suitability for disease establishment and spread (source: Jhorar et al. 1992)
Fig. 5
figure 5

Humid thermal index of February and March months of Ludhiana and Bathinda

Potential predictor attributes

Eleven attributes were chosen as possible predictor variables (as shown in Table 2), and many of these attributes have been recognized as significant factors in previous studies regarding the development of Karnal bunt disease.

Table 2 Potential predictor attributes

Machine learning regression models

The collected dataset was split into training and testing sections. The 70 per cent of the total dataset was used as training dataset while the remaining 30 per cent dataset was used as testing data. Machine learning regression models were applied to the dataset to train the model to predict disease. The process of modelling is shown in Fig. 6. These models include artificial neural networks (ANN), efficient neural network (ENET), k-nearest neighbour (kNN), least absolute shrinkage and selection operator (LASSO), Bayesian least absolute shrinkage and selection operator (BLASSO), support vector regression (SVR), ridge regression (RIDGE), Bayesian ridge (BRIDGE), multiple linear regression (MLR), principal component regression (PCR), and random forest (RF).

Fig. 6
figure 6

Process of modelling

Model accuracy terms/indices

Six of the most common accuracy metrics of regression models were used: root mean square error (RMSE), root relative square error (RRSE), correlation coefficient (r), the relative mean absolute error (MAE), modified D-index and modified NSE. Table 3 shows regarding estimation of these matrics.

Table 3 Model accuracy terms/ indices

Results and discussion

Descriptive statistics

The descriptive statistics of studied weather parameters is presented in Tables 4, 5, and 6. In these tables, ranges of different meteorological parameters along with mean and standard deviation are presented for the periods under study. The mean maximum temperature during March month was mostly higher (27.34°C) than February (21.68°C) month and 15 February-15 March period (23.98°C). The mean number of days when optimum maximum temperature prevailed was higher (15.00) during March as compared to 15 February-15 March period (10.58) and February (3.17). During March, the mean minimum temperature was also higher (13.27°C) than that in February (9.07°C) month and 15 February-15 March period (10.98°C). The mean number of days when optimum minimum temperature prevailed were higher (17.92) during March as compared to15 February-15 March period (16.25) and February (10.00). Lower mean relative humidity was observed during the March (67.31%) month followed by 15 February-15 March period (71.76%) and February (73.80%). The mean number of days when mean relative humidity was higher (19.42) during February as compared to15 February-15 March period (16.50) and March (12.17). Lesser mean number of rainy days were observed during March (2.58) month as compared to 15 February-15 March period (3.17) and February (2.75). The mean number of weeks with continuous rainy days were higher for 15 February-15 March (0.67) period and equal for February and March month i.e. 0.58. The mean number of days when at least 10 mm rainfall was recorded were higher for 15 February-15 March (2.42) period as compared to February (2.33) and March (2.00) month. Lower mean HTR was observed for March (1.76) as compared to February (2.61) and 15 February-15 March period (2.27).

Table 4 Descriptive statistics of studied parameters for Karnal bunt of wheat for February month
Table 5 Descriptive statistics of studied parameters for Karnal bunt of wheat for March month
Table 6 Descriptive statistics of studied parameters for Karnal bunt of wheat for 15 February-15 March month

Development of disease prediction model

The results here are depicted here in both visual (Fig. 7) and numerical fashions (Tables 7, 8 and 9). The results demonstrate the adequacy of various methods of machine learning for prediction of Karnal bunt for different time period taken under study. The most intriguing finding is that for each period different models for disease prediction were perceived. The results accomplished surpass the earlier work in this area. The Taylor diagram (Fig. 7) provides readers with a comprehensive understanding of the degree of similarity between patterns in terms of correlation, root-mean-square difference, and variance ratio. While this diagram has a general application, it proves to be particularly valuable in assessing complex models.

Fig. 7
figure 7

Taylor diagram

Table 7 Root mean square error (RMSE), Coefficient of determination (R2) and Correlation coefficient (r) values of different machine learning models
Table 8 Root relative square error (RRSE) and Mean absolute error (MAE) values of different machine learning models
Table 9 Modified Index of agreement (d) and Modified Nash–Sutcliffe efficiency (E) values of different machine learning models

As shown in Fig. 7, one can clearly see the observed or reference field, usually representing observed state. Another field is denoted as a test field usually representing model-simulated field. The purpose is to develop a theoretical framework of how closely the test field bear a resemblance to the reference field. The radial distances from the origin to the points represent pattern standard deviations. Correlation coefficient between two fields is illustrated by the azimuthal positions. The dashed lines represent RMSE values. For each period, cross-location multiple regression models {artificial neural networks (ANN), efficient neural network (ENET), k-nearest neighbour (kNN), least absolute shrinkage and selection operator (LASSO), Bayesian least absolute shrinkage and selection operator (BLASSO), support vector regression (SVR), ridge regression (RIDGE), Bayesian ridge (BRIDGE), multiple linear regression (MLR), principal component regression (PCR), and random forest (RF) approaches}were validated against each other. The models such as efficient neural network (ENET), k-nearest neighbour (kNN), Bayesian least absolute shrinkage and selection operator (BLASSO), support vector regression (SVR), Bayesian ridge (BRIDGE), principal component regression (PCR), and random forest (RF) for February month perform relatively well because they lie relatively close to the reference point. Unlike others, the models such as artificial neural networks (ANN), k-nearest neighbour (kNN) and multiple linear regression (MLR) grossly underestimated the results for March month. Bayesian least absolute shrinkage and selection operator (BLASSO), support vector regression (SVR) and principal component regression (PCR) were reported as good models for 15 February-15 March period. All the models except efficient neural network (ENET) and k-nearest neighbour (kNN) grossly underestimated the results for overall period.

Validation of developed models

For February month, lower than mean RMSE, RRSE and MAE were observed for the models ENeT (14.48% ,117.11% and 12.89 %), BRIDGE (14.51%, 117.32 % and 12.33 %), BLASSO (12.35%, 99.86 % and 10.64 %), SVR (10.99%, 88.89 % and 9.76 %), RF (7.70%, 62.24 % and 6.20 %) and kNN (11.18%, 90.36 % and 10.10 %), respectively (Tables 7 and 8). The lowest RMSE, RRSE and MAE values were recorded for random forest (RF) model. The modified index of agreement (d) and modified Nash–Sutcliffe efficiency (E) values went maximum up to 0.73 and 0.44, respectively for RF model (Table 9). This indicates that d and E are not sensitive to systematic over or under prediction unlike other models. The coefficient of determination (R2) and correlation coefficient (r) was highest for kNN model i.e., 0.86 and 0.93, respectively. For March month, the minimum RMSE, RRSE and MAE was observed for support vector regression (SVR) i.e., 8.97%, 72.51 % and 7.25%, respectively. But the coefficient of determination (R2) and correlation coefficient (r) was highest for RF model i.e., 0.91 and 0.95, respectively and for SVR, R2 and r were 0.83 and 0.91, respectively. The d and E values went maximum up to 0.71 and 0.34, respectively for SVR model that makes this criterion not much sensitive to quantification of systematic over or under prediction errors whereas the d and E values for RF model were quite less 0.57 and -0.28, respectively as compared to SVR model. SVR with RMSE value as 18.91%, RRSE as 152.91% and MAE value as 16.97% and BLASSO model with RMSE value as 18.78%, RRSE as 151.88% and MAE value as 16.53% perform relatively well as compared to other models for period 15 February to 15 March. The d values for BLASSO and SVR were 0.49 and 0.5, respectively and E values for BLASSO and SVR went maximum up to -0.50 and -0.54, respectively. But the coefficient of determination (R2) and correlation coefficient (r) was highest for SVR model i.e., 0.97 and 0.98, respectively followed by ANN (R2 = 0.94 and r=0.97). For overall period, little glitches were observed. Lower RMSE, RRSE and MAE values were observed for SVR (12.15%, 98.22% and 10.59%), ENET (15.91%, 128.65% and 14.52%) and RF (16.16%, 130.46% and 13.71%) models. The corresponding d and E values were SVR (0.46 and 0.04), ENET (0.48 and -0.32) and RF (0.59 and -0.25). But the value of index of agreement (d) was higher for RF (0.59), RIDGE (0.55) and BLASSO (0.54) and Nash–Sutcliffe efficiency (E) values went maximum up to 0.04, -0.25 and -0.32 for SVR, RF and ENET models. The coefficient of determination (R2) and correlation coefficient (r) was highest for RF model i.e., 0.90 and 0.95, respectively followed by SVR (R2 = 0.85 and r = 0.92).

Tuning parameters of the machine learning models

Tuning parameters are used in statistical modeling, particularly in shrinkage methods like RIDGE regression, LASSO regression, or Elastic Net (Table 10). They control the amount of shrinkage applied to model coefficients or data values. Shrinkage helps create simpler, more interpretable models and avoids overfitting when dealing with high-dimensional data or many predictors. The central point, often the mean, represents a prior belief about the data distribution. Shrinkage makes models more stable, robust, and better at generalizing to new data. It is valuable for limited data and problems with numerous predictors.

Table10 Tuning parameters of the machine learning models

Conclusion

After rigorous investigation, key findings were emerged regarding the adequacy of various methods of machine learning for prediction of Karnal bunt for different time period taken under study. The most intriguing finding is that for each period, different models have performed well for disease prediction. The random forest regression (RF) for February month, support vector regression (SVR) for March month, SVR and BLASSO for 15 February to 15 March period and random forest for overall period surpassed the performance than other models. The suitability of these methods can be assessed for real time data and can be used for forewarning of Karnal bunt in Punjab.