Introduction

Tomato fruit is produced and consumed worldwide, with over 20 million metric tons of production estimated in 2022, India [24] because of the increased usage of processed foods. However, tomato plants suffer from a severe disease called early blight, caused by the fungi Alternaria solani, which may lead to huge losses of \(35\%\) to \(78\%\) in crop yields [18]. The disease appears as dark spots on the leaves and stem of the tomato plants. These images have been acquired from the Indian Agricultural Research Institute (IARI), New Delhi, India, where the tomato plants have been grown and monitored to collect the real-time dataset utilized in this study. The image shown in indicates the leaves infected with a medium severity, and the image in shows the leaves infected with high severity. Meteorological factors such as minimum and maximum temperatures, rainfall, relative humidity, leaf wetness, and sunshine hours significantly impact the growth of severe diseases in plants. Due to climate change and emissions of harmful gases in the atmosphere, the risk of such diseases is even higher [30]. With the world population estimated to increase by 35–50\(\%\) by 2050, there is an urgency to reduce food losses to meet global food requirements. As one of the 17 sustainable development goals, preventing food losses will aid in achieving zero hunger in developing countries like India [15]. Appropriate use of fungicides may prevent disease growth, but its excessive use may also deteriorate the situation further and make the disease more severe [2]. Unnecessary spraying of expensive pesticides or fungicides on healthy plants may adversely affect crop quality and lead to cost inefficiency. Due to limited knowledge among the farmers, hiring experts for spray advice is required, which is expensive. As a result, farmers suffer substantial financial losses because of crop yield dissipation.

A timely prediction of such diseases is necessary so that appropriate amount of fungicides can be sprayed only on conducive days. Lately, many studies have been focusing on early blight prediction in tomato plants. Some of the recent studies are mentioned in Supplementary Table 1. It also captures the dataset types and techniques used in the studies. The performance achieved in these studies in terms of accuracy is also mentioned in the table. It can be observed from the table that despite numerous studies being available for early blight prediction in tomatoes, most of them classify the disease on leaf images. However, after the infection occurs, the disease takes some time to be visible on the leaves, usually 5–7 days [20], and only after that can it be detected through leaf images. On the other hand, the impact of weather parameters on the disease growth can be analyzed at a prior stage (much before disease visibility on leaves); accordingly, fungicides can be sprayed in the early stages of the disease. This prevents the spread of the disease firsthand. Few studies have utilized the impact of weather parameters for early blight detection in tomato plants using machine learning techniques. Sudarshan et al. [27] studied the effect of six weather parameters—minimum and maximum temperature (T), morning and evening relative humidity (RH), rainfall and rainy days on the disease severity(DS) during 20 meteorological weeks. They used multiple linear regression analysis and concluded that an increase of \(1\%\) morning RH increases the DS by \(0.446\%\), and an increase of \(1^\circ\)C in minimum T decreases the DS by \(12.606\%\). In the study [3], the authors performed feature extraction using fuzzy membership functions in a real-time weather dataset and used the Kernel extreme learning machine algorithm to predict the plant as diseased or healthy. The algorithm was optimized using Genetic algorithms and achieved an accuracy of \(83.33\%\) on the 70–30 train test ratio.

Apart from weather parameters, some recent studies have also worked on agricultural production practices and analyzed the effect of crop residue mulch, irrigation and nitrogen on soil quality and growth and yield of maize crop [5,6,7]. Furthermore, in [9, 10], the authors utilized sensor data for retrieving surface soil moisture and soil physicochemical properties, modelling machine learning techniques. They further analyzed the role of soil moisture indices for early drought forecasting in several US regions [8]. In [11, 12], the authors used eddy covariance to compute evapotranspiration, crop coefficients for the paddy crop in tropical humid climates and studied the impact of paddy cultivation on the net ecosystem exchange of carbon dioxide. They observed that paddy is capable of capturing \(\mathrm{{CO}}_2\) from the atmosphere.

The current study focuses on climate factors for disease detection in tomato plants, utilizing a real-time dataset of five weather parameters, and without incorporating any complex feature extraction techniques, it emphasizes on predicting the tomato plant as healthy or diseased (early blight). Five machine learning techniques—k-Nearest Neighbor(kNN), Support Vector Machine(SVM), Random Forest(RF), Artificial Neural Network(ANN), and Kernel Extreme Learning Machine(KELM)—have been analyzed for the same. The techniques have been optimized by tuning the hyperparameters using the optimization library—Optuna. Since the dataset is imbalanced, three resampling techniques, namely Synthetic Minority Oversampling Technique(SMOTE), K-Means SMOTE(KM-SMOTE) and Support Vector Machine SMOTE(SVM-SMOTE), have been used for data balancing. In total, 20 models have been examined for performance—kNN-Imb, kNN-SM, kNN-KM, kNN-SVM, SVM-Imb, SVM-SM, SVM-KM, SVM-SVM, RF-Imb, RF-SM, RF-KM, RF-SVM, ANN-Imb, ANN-SM, ANN-KM, ANN-SVM, KELM-Imb, KELM-SM, KELM-KM, KELM-SVM, where Imb stands for imbalanced data. Accuracy has been considered as the performance measure for all the models. The best model outperforming all others is then compared with the previous studies for performance. The contribution of the study has been summarized as follows:

  • Utilization of a real-time dataset of five weather parameters for studying the impact on early blight disease growth in tomato plants

  • Utilization of three data balancing techniques—SMOTE, KM-SMOTE and SVM-SMOTE to achieve better performance

  • Examination of five machine learning techniques for predicting the plant as healthy or diseased

  • Hyperparameter tuning of all the five techniques using Optuna framework for optimization

  • Performance testing of the 20 models using accuracy measure

  • Analyzing the importance of hyperparameters of the algorithms for accuracy optimization

  • Comparison with the existing state of the art

  • Suggesting the best model for timely spray, thus preventing the overuse of fungicides and crop quality degradation

The remaining paper is organized in the following manner: Sect. 2 provides the dataset collection and methods used. Section 3 shows the results, followed by a discussion. The conclusion and future directions are presented in Sect. 4.

Materials and Methods

This section presents the essential aspects of this research and is divided into four subsections. The data gathering approach is explained in the first Sect.2.1. The techniques used for balancing the data are briefed in the second Sect. 2.2. The machine learning algorithms and the hyperparameter optimization are discussed in Sect. 2.3. The research methodology diagram showing the course of events is presented in Sect. 2.4.

Data Gathering

For the real-time early blight data collection, the tomato seedlings were planted at ICAR-Indian Agricultural Research Institute, Pusa Campus, New Delhi-110012, in January 2021. After a month, the pathogen Alternaria solani was inoculated in some plants and the disease was visible in April 2021. Five weather parameters have been collected through the sensors—Leaf Wetness (LW between 0/dry and 15/wet), Soil Moisture (SM in volumetric water content), Temperature (T in \(^{\circ }\)C), Relative Humidity (RH in %) and Dew Point (DP in \(^{\circ }\)C). The DP was calculated [22] using the T and RH in the equation

$$\begin{aligned} {\text {DP}} = T - ((100-{\text {RH}})/5) \end{aligned}$$
(1)

The sensors and software used for capturing the above parameters are mentioned in Supplementary Table 2. The LW sensor was connected to the Port A and the SM sensor to the Port D of the WatchDog data logger. The logger is inbuilt with T, RH and DP sensors. Two data loggers were used for parameter measurements in both the healthy and diseased plants. The data loggers are then connected to the computer using an interface cable and using the SpecWare 9 Pro software; the connection to the data logger is tested (Go to Preferences-Communications). For more information on the software, the link https://www.specmeters.com/specware/specware-pro/ can be referred. After the first installation of the software, the Preferences must be specified. The selected preferences are shown in. The metric option was selected, the checkbox for Dew Point was checked and the device support for 1000 series logger was added. All the other default values were used. The Data Storage tab is used to specify the default location where the logger data will be downloaded. After setting the Preferences, the WatchDog Manager is opened from the toolbar for adding two new 1000 series logger stations with direct connection—IARI_Diseased and IARI_Healthy. The WatchDog Manager screen is shown. Once the stations are added, their properties can be set by clicking on the Logger option from the menubar. At the time of setup of the watchdog properties, the data collection interval was set to 30 min, ports A and D were enabled and the start date and time were specified. The data were collected from March 2021 to May 2021. The WatchDog Properties screen after the data collection is shown. For downloading the data collected by a logger, the OK button in the WatchDog Properties Screen besides the red text is pressed and the data are downloaded to the default location specified earlier using Preferences. SpecWare also allows for graph analysis for a particular data logger after data collection. The T and DP graphs since the placement of the IARI_Healthy logger is shown. A real-time weather monitor is also available with SpecWare, which displays the current weather conditions as well as the readings from the sensors that are plugged externally. It is shown. Once the data are collected, it can be viewed using the book icon in the toolbar, which displays the data in text format.

Images of the tomato plants along with the placement of sensors are shown in Supplementary Fig. 1(a) and 1(b). Radiation shields were also used to protect the sensors from rainfall or harmful radiation, as shown in Supplementary Fig. 1. The collected dataset comprised of numerous duplicate records since changes in the weather parameters are not frequent. Also, due to noise and network issues, some of the records had missing values. Hence, the data were pre-processed and the records with missing values were eliminated. Also, to deal with similar records, the data were averaged over 6 h intervals. The resultant dataset has a total of 380 samples, 250 healthy and 130 diseased, which is imbalanced. The data imbalance problem along with the resolution techniques are described next.

Data Balancing Techniques

Training an algorithm on imbalanced data may lead to biased results toward the majority classes, leading to inappropriate inference [28]. Three oversampling techniques for data balancing have been used in this study and are briefed as follows:

  • Synthetic Minority Oversampling Technique (SMOTE): This technique balances the data by oversampling the minority class samples and/or under sampling the majority class samples [13]. For the synthetic generation of data points, it selects a minority data point and computes its k neighbors. Then, it randomly chooses one of the k neighbors and calculates the difference between the point and its neighbor. After that, a random number between 0 and 1 is generated and multiplied by the difference. The product is then added to the original data point to get the newly generated synthetic data point. The process is repeated until the data are balanced.

  • K-Means SMOTE (KM-SMOTE) This technique combines the k-means clustering algorithm and SMOTE [21]. First, the data are clustered into k clusters using the k-means algorithm. After that, the clusters are filtered, and only those clusters containing large proportions of minority points remain after the filtering step. Each selected cluster is then assigned the number of synthetic points it needs to generate, with a higher number assigned to the cluster having more minority points distributed sparsely. SMOTE is then used to generate synthetic samples in each cluster.

  • Support Vector Machine SMOTE (SVM-SMOTE) This technique is a combination of Support Vector Machines (SVM) and SMOTE [23]. First, SVM is run on the dataset, and positive support vectors are computed. Then, each support vector is assigned an equal number of synthetic samples to generate. For each support vector, its m neighbors are computed. If more than m/2 of its neighbors are from the minority class, only then SMOTE is used to generate synthetic samples using its first to the \(k\)th nearest neighbor.

A popular technique called Random Oversampling has not been used here since it duplicates the minority samples randomly for balancing and thus may lead to overfitting in case of small-size datasets.

Machine Learning Classifiers and Hyperparameter Tuning

Machine learning(ML) is a subset of Artificial Intelligence used for classification or regression tasks by learning through examples [19]. In the current study, five different supervised ML classifiers have been used to classify the tomato plants as healthy or diseased. The algorithms are used both on imbalanced and the three balanced datasets resulting in 20 different models. The ML algorithms use parameters whose values can be optimized for better performance results. These parameters are called hyperparameters of the algorithm, and the process of optimizing these is known as Hyperparameter Tuning. The hyperparameters and the technical aspects of the classifiers are explained as follows:

  • SVM: This algorithm separates the data by finding the best hyperplane for division [16]. The hyperplane chosen is called decision boundary as it decides the class to which a new data point will belong. SVM selects the optimal hyperplane from the many possible ones by maximizing the margin or distance between any data point and the hyperplane. Marginal lines are represented as the lines parallel to the hyperplane on each side, margin length apart from the hyperplane. The data points that lie on the marginal lines are called support vectors. The margin is maximized for better classification of points lying on either side of the hyperplane. Figure 1a represents a dataset having two features—leaf wetness and soil moisture and two class labels—diseased (red points) and healthy (green points). The middle line represents the decision boundary DB, and the two dotted parallel lines—ML1 and ML2—are the marginal lines. The support vectors are also shown on the marginal lines.

  • kNN: This algorithm takes the training data as input, and then, for a test data point p, it calculates the distance of p with all the data points in the training set [14]. The distance metric used can be Euclidean or Manhattan. It also takes a number k as input which decides the number of closest points or neighbors to consider while deciding on the class label. It looks at the class labels of all \(p's\) nearest k neighbors and chooses the label attained by the maximum neighbors. Figure 1b shows the same dataset as in Fig. 1a. To compute the label for the point p, its \(k=3\) neighbors are considered. Since it has two diseased neighbors and one healthy neighbor, the point p will be classified as diseased.

  • RF: This technique uses multiple decision trees built using distinct samples drawn from the data with replacement [4]. This would lead to a high correlation among the trees since they are built using the same dataset. Therefore instead of the whole set of features, a subset of features is selected randomly to build each tree. For the final result, the majority voting rule is used on the set of the results. Figure 1c shows a dataset having 4 features and n samples. k trees are built out of it by taking a subset of samples as well as features.

  • ANN: It can be imagined as a directed weighted graph where the nodes represent neurons and edges connect the nodes and carry weights on them [17]. It consists of an input layer, one or more hidden layers and an output layer. The input layer represent a single data point, having neurons equal to the number of features in the dataset. Inputs are passed to the hidden layers which may contain any number of neurons for further processing. Finally the output layer gives the predicted results of the model. Figure 1d shows a diagrammatic representation. The value at hidden neuron x1 will be computed as

    $$x1 = f(({\text{LW}}*w1) + ({\text{SM}}*w2) + (T*w3) + ({\text{RH}}*w4)){\text{ }}$$
    (2)

    where f(z) is an activation function. Considering ReLU function,

    $$f(z) = {\text{max}}(z,0)$$
    (3)

    Similarly, the output y will be computed as

    $$y = {\text{sigmoid}}((x4*w22) + (x5*w23) + (x6*w24))$$
    (4)
  • KELM: This algorithm is based on neural networks with a single hidden layer [31]. Consider the diagram in Fig. 1(e). Each data point has 1 features—LW, SM and T. There are two hidden neurons and the output y is one hot encoded for the two classes healthy and diseased. Then for \(N=2\) hidden neurons and \(c=2\) classes, output Y is given as

    $$\begin{aligned} Y(x) = \sum _{n=1}^{N} w_n \ F(a_n,b_n,x) \end{aligned}$$
    (5)

    where \(F(a_n,b_n,x) = a_n x+b_n\), \(a_n\) are weights on edges from input layer to hidden layer, \(b_n\) is the bias term and \(w_n\) are the weights on edges from hidden layer to output layer. In matrix form,

    $$\begin{aligned} F.W = Y \end{aligned}$$
    (6)

    where for m training examples \(t_1,t_2,....t_m\), \(F = \begin{bmatrix} F(a_1,b_1,t_1) &{} ... &{} F(a_N,b_N,t_1)\\ : &{} ... &{} : \\ F(a_1,b_1,t_m) &{} ... &{} F(a_N,b_N,t_m) \end{bmatrix}\), \(W = \begin{bmatrix} w_1^T \\ : \\ w_N^T \end{bmatrix}\), \(Y= \begin{bmatrix} y_{11} &{} ... &{} y_{1c} \\ : &{} ... &{} : \\ y_{m1} &{} ... &{} y_{mc} \end{bmatrix}\)

    W can be computed using pseudo-inverse of F,

    \(W = F^T (\frac{1}{C} + {\text{FF}}^T)^{-1} Y\), where C is the regularization coefficient. Therefore, the ELM equation can be written as

    $${\text{ELM}}(t) = F(t).F^{T} \left( {\frac{1}{C} + {\text{FF}}^{T} } \right)^{{ - 1}} Y$$
    (7)

    To prevent overfitting or getting stuck into local minima, a kernel matrix \(\delta\) can be used with the ELM, thus resulting in KELM as

    $${\text{KELM}}(t) = F(t).F^{T} \left( {\frac{1}{C} + \delta } \right)^{{ - 1}} Y{\text{ }}$$
    (8)

    For this study, Radial Basis Function(RBF) kernel [29] has been used which computes the closeness of two data points and can be defined as

    $$\begin{aligned} \delta (t_1,t_2) = e^{\left( -\frac{||t_1-t_2||^2}{k}\right) } \end{aligned}$$
    (9)
Fig. 1
figure 1

ML Classifiers

Here, k represents the RBF kernel parameter. The python library Optuna [1] has been used for the hyperparameter tuning. It requires an objective to be defined, which can be maximized or minimized. The parameters to optimize are defined within the objective, and Bayesian optimization algorithm is used for the optimization process. The parameters that have been optimized concerning the used algorithms are shown in Supplementary Table 3. The range of values these parameters can take is also shown in the table.

Research Methodology

The methodology diagram for the approach followed in this study is shown in Supplementary Fig. 2. The imbalanced dataset TomEBD has been balanced using three techniques—SMOTE, KM-SMOTE and SVM-SMOTE. After resampling on the basis of the minority class, there are 250 data points for both the healthy and diseased classes—resulting in a total of 500 data points.

After resampling, four types of datasets are available—Imbalanced, SMOTE-Balanced, KM-SMOTE-Balanced and SVM-SMOTE-Balanced. All four datasets are provided as input for training and testing of the five ML algorithms. The train-test ratio has been selected as the standard 70-30. The classification results of all the five algorithms on all four datasets have been optimized by tuning their hyperparameters utilizing the Optuna framework. The hyperparameters are passed to the framework along with the performance measure to be optimized—accuracy in this case. Accuracy represents the fraction of correctly classified data samples to the total number of samples [26]. The formula is given as:

$${\text{Accuracy}} = \frac{{{\text{TH}} + {\text{TD}}}}{{{\text{TH}} + {\text{TD}} + {\text{FH}} + {\text{FD}}}}$$
(10)

Here, TH represents healthy samples that were truly classified as healthy and FD represents healthy samples that were falsely classified as diseased. A similar explanation applies to True Diseased (TD) and False Healthy (FH). The optimization has been carried out over 100 iterations to find the best parameter values to maximize accuracy. The parameters that have been optimized have already been mentioned in Supplementary Table 3. For all the other remaining parameters, default values have been taken. An algorithm for hyperparameter optimization of KELM algorithm is shown in Algorithm 1. Here, X represents the balanced input data and Y represents the output labels. The similar approach is followed for all other algorithms.

Algorithm 1
figure a

Hyperparameter Optimization

The resulting 20 models—kNN-Imb, kNN-SM, kNN-KM, kNN-SVM, SVM-Imb, SVM-SM, SVM-KM, SVM-SVM, RF-Imb, RF-SM, RF-KM, RF-SVM, ANN-Imb, ANN-SM, ANN-KM, ANN-SVM, KELM-Imb, KELM-SM, KELM-KM and KELM-SVM—are then analyzed based on their mean accuracy values, for the selection of best model for class prediction. For stable results, the algorithms have been executed over 10 iterations and the mean accuracy over all the iterations has been considered as the final performance result. The best model is then selected with the highest mean accuracy among all.

Scikit-learn library [25] in python has been used for the implementation purpose, and the results are plotted using the Matplotlib library in python. The performance measure value over the 100 iterations of the optimization and the importance of hyperparameters in optimization has also been shown by the graphs plotted using Optuna framework. The implementations were executed on macOS Big Sur Version 11.3.1 with 8GB RAM.

Results and Discussion

This section is divided into four subsections. The first Sect. 3.1 discusses the performance of the 20 models evaluated on the TomEBD dataset using mean accuracy measure and also shows the optimization history of the algorithms over 100 iterations. The second Sect. 3.2 shows the optimized values of the hyperparameters for all the five algorithms and also presents the importance of the hyperparameters in accuracy maximization. The third Sect. 3.3 compares the current study with previous studies on early blight prediction in tomatoes using weather parameters. The fourth Sect. 3.4 presents a discussion of the entire study.

Performance Results

This subsection presents the classification results for all the 20 models tested—kNN-Imb, kNN-SM, kNN-KM, kNN-SVM, SVM-Imb, SVM-SM, SVM-KM, SVM-SVM, RF-Imb, RF-SM, RF-KM, RF-SVM, ANN-Imb, ANN-SM, ANN-KM, ANN-SVM, KELM-Imb, KELM-SM, KELM-KM and KELM-SVM. The models are compared based on their performance using the accuracy measure. The results are shown in Supplementary Fig. 3. In the figure, the blue colored bars represent imbalanced models, the pink colored bars represent the SMOTE balanced models, the green colored bars represent the KM-SMOTE balanced models, and the gray colored bars represent the SVM-SMOTE balanced models. The following observations have been made:

  • Considering the imbalanced models, kNN-Imb performed best with 79.82% mean accuracy. However, all the balanced models performed far better than the imbalanced models.

  • The models that were balanced using the KM-SMOTE technique performed better than those that were balanced using either SMOTE or SVM-SMOTE techniques in all the cases.

  • All the SVM-SMOTE balanced models performed better than the SMOTE balanced models except in the case of kNN where kNN-SM performed better than kNN-SVM with mean accuracy \(84\%\).

  • ANN models performed poorly when compared to others since the dataset is small while neural networks are known to perform well on large datasets.

  • SVM models also performed poorly (other than SVM-KM) compared to other models. One of the reasons is that SVM fails for imbalanced data due to skewness of the hyperplane toward the minority class and the imbalance in the number of support vectors. Secondly, SMOTE and SVM-SMOTE balanced datasets may have introduced noisy samples leading to the features such as T and RH having similar or overlapping range of values, and therefore label overlapping. The consequence is poor performance of the models SVM-SM and SVM-SVM.

  • Among all the 20 models analyzed, KELM-KM performed best with the highest mean accuracy of \({\textbf {85.82\%}}\).

The objective values or mean accuracy values of the five best performing models using each ML classifier—kNN-KM, SVM-KM, RF-KM, ANN-KM and KELM-KM, over 100 iterations are shown in Supplementary Fig. 4. It can be observed that SVM-KM and ANN-KM reach to optimal accuracy in early iterations while others achieve the same in later iterations.

Optimized Hyperparameters and their Importance

This subsection shows the optimized values of the hyperparameters and discusses their importance in accuracy maximization. The optimized hyperparameter values are shown in Supplementary Table 4 for the models based on kNN and SVM, Supplementary Table 5 for the models based on RF, and Supplementary Table 6 for the models based on ANN and KELM. The following observations can be made:

  • It can be observed from Supplementary Table 4 that among all four SVM models, SVM-KM performed best with linear kernel and polynomial of degree 1. This indicates that it separated the data using a straight line as the decision boundary. Also, among all the four kNN models, all of them used distance as the weight metric, and the majority used manhattan as the distance measure. The best of all four—kNN-KM, considered 4 nearest neighbors for predicting the class labels.

  • Considering the RF models in Supplementary Table 5, all four models used entropy as the split criteria. The best performer among these models, denoted as RF-KM, used the maximum estimators or decision trees, a mediocre \(max\_depth\) value of 50 and the square root of the total number of features as the \(max\_features\).

  • Considering the best performing ANN model in Supplementary Table 6, ANN-KM used a mediocre \(learning\_rate\) among all the four ANN models, Adam optimizer and only 1 hidden layer, with 261 hidden neurons.

  • Among all the four KELM models in Supplementary Table 6, the best performing model KELM-KM used the minimum value of regularization coefficient C compared to all others, indicating less data overfitting. The model used a small value, k=2, for the RBF kernel parameter k.

The role that the hyperparameters played in the optimization of the objective is represented through their importance, for all the five best performing models in Fig. 2. The following observations have been made from the figure:

  • The parameters that greatly affect the optimization of kNN are—weights (\(78\%\)) and no. of neighbors \(n\_{\text{neighbors}}\) (\(22\%\)). The distance metric hardly plays any role in the optimization. This can be seen in Fig. 2a.

  • The optimization of SVM depends entirely on the chosen kernel (\(99\%\)), with a slight dependence on the regularization coefficient C and degree of the polynomial - (\(< 1\%\)). This can be seen in Fig. 2b.

  • For the KELM optimization, both the parameters C and k play vital roles. However, the kernel parameter k seems much more important (\(81\%\)) than the regularization coefficient C (\(19\%\)) for accuracy maximization in the dataset.

  • RF is greatly affected by the no. of decision trees chosen, \(n\_{\text{estimators}}\) (\(59\%\)), maximum depth of the trees, \({\text{max}}\_{\text{depth}}\) (\(30\%\)), and maximum features assigned to each tree, \({\text{max}}\_{\text{features}}\) (\(10\%\)). The splitting criteria hardly play any role in the optimization. This can be seen in Fig. 2d.

  • Learning Rate plays the most significant role in the optimization of ANN (\(58\%\)), followed by the no. of units in each layer (\(33\%\)). The choice of optimizer and the no. of layers are also important (\(10\%\) combined). Figure 2e shows the same.

Fig. 2
figure 2

Optimized Hyperparameter Importance

Comparison with Existing Studies

This subsection compares the results of the current study with the previous studies that worked on predicting early blight in tomatoes, based on weather parameters. In the study [27], the authors used regression analysis to form rules based on the weather parameters and disease severity. Hence a comparison cannot be made. However, in the study [3], the authors worked on the same weather dataset to classify the plant as diseased or healthy by incorporating feature extraction (FFFT). They balanced the dataset using Random Oversampling (ROS), Importance Sampling (IMPST) and SMOTE and utilized optimized KELM (OKELM) using genetic algorithms for the classification task. They used five different train-test ratios and presented the results. Considering the results using 70-30 train-test ratio, the comparison of [3] with the current study is shown in Supplementary Table 7.

It can be observed from Supplementary Table 7 that four of the proposed KM balanced models performed better than all four approaches proposed by [3], without making any use of the complex feature extraction techniques. Also, the proposed model KELM-KM performs best among all with 85.82% accuracy.

Discussion

All the above observations state that KELM-KM is the leading model among all, with the highest mean accuracy value of 85.82%. Also, all the KM-SMOTE balanced techniques performed better than the imbalanced, SMOTE and SVM-SMOTE balanced techniques. This is because KM-SMOTE targets the generation of artificial samples in the areas where it is most effective—the clusters with more minority samples. This leads to the prevention of the generation of noisy samples. Moreover, KELM is a powerful algorithm with robustness toward the noisy data and regularization parameter for dealing with the problem of overfitting. With the optimization of the regularization parameter and kernel parameter, it performs exceptionally well on KM-SMOTE balanced dataset. A non-iterative implementation of the algorithm also ensures less running time on large datasets. The employment of Optuna framework for hyperparameter tuning implies a significant impact on the performance of all the learning models. The proposed model even outperforms the existing state-of-the-art involving complex feature extraction techniques and optimization.

The proposed study takes into account the weather parameters for disease prediction in plants. However, certain other factors such as soil contents—moisture, pH, nutrients, presence of various gases in the atmosphere, color properties, stomatal resistance, and turgor pressure—can also be leveraged along with the weather parameters for better and earlier prediction of such harmful diseases in agricultural plants. Further, different cultivar types in various regions can also be considered since the climatic conditions vary substantially across regions. Also, the study only focuses on predicting the early blight disease without intervening on disease severity, which is essential for identifying diseases in early stages.

Disease forecasting ML models are often susceptible to uncertainties. The data collected through IoT sensors may be affected with noise since the sensors are connected to the network. Noise in the data may lead to incorrect predictions. However, several noise handling algorithms are available and can be used for data filtering and cleaning. Furthermore, ML models have hyperparameters that need to be tuned for better predictions. Simple guesses of hyperparameter values may lead to below par results. Hyperparameter optimization algorithms for tuning can be utilized to address the same. External unforeseen factors may also influence model performance, for example, on days with excessive rainfall, the model may not be able to predict the conduciveness of a disease in real-time. The models should be trained with utmost available knowledge to improve performance in unforeseen situations.

Conclusions

In the current research study, the authors have emphasized predicting early blight disease in tomato plants based on meteorological factors—temperature, relative humidity, soil moisture, leaf wetness and dew point. For the data balancing, three resampling techniques have been used—SMOTE, KM-SMOTE and SVM-SMOTE—and five classifiers, namely kNN, SVM, RF, ANN and KELM, have been tested for performance on the dataset—resulting in 20 models. The hyperparameters of all the five algorithms have been optimized using a python framework—Optuna, for accuracy maximization. The results indicate that the KELM classifier used on KM-SMOTE balanced dataset outperforms all other classifiers with a mean accuracy value of 85.82%. It has also been observed that all the five classifiers utilizing the imbalanced dataset performed poorly in comparison with the balanced datasets, and the models utilizing KM-SMOTE balanced dataset performed much better than the models utilizing datasets balanced using SMOTE or SVM-SMOTE techniques. A comparison with existing state-of-the-art techniques has also been drawn. It has been revealed that the models kNN-KM, SVM-KN, RF-KM and KELM-KM outperformed the existing, more complex methods used for disease prediction on the same dataset. The optimization history has also been presented, along with the importance of the hyperparameters in objective maximization. Hence, the model KELM-KM can be used in real-time applications to predict a tomato plant as diseased or healthy. Accordingly, the use of fungicide spray may be reduced by only spraying in the predicted diseased plants. This leads to a cost-efficient model that prevents the degradation of tomato crops by excessive use of fungicides, even on healthy plants.

Future research can be done on feature extraction techniques that may result in a better performance. Various deep learning models can also be analyzed for the same. A real-time mobile-based application can also be developed that will alarm the farmers on a regular basis in case the plants are found to be diseased. Future works may also focus on predicting early blight disease severity based on weather parameters, so that appropriate measures can be taken when the disease severity is in lower stages. Last but not least, other factors such as soil and leaf properties and amount of atmosphere gases can also be employed for better prediction results.