Introduction

The importance of crop production forecasting is increasing as concerns over food security grow. Early crop yield prediction plays a significant role in averting famine by anticipating the amount of food available for the expanding global population. Increased crop yield output is a workable answer to one of the world's most serious problems: hunger [1]. There is still not enough food available for 820 million people worldwide, based on the World Health Organization (WHO) analysis. By 2030, as stated by the United Nations' Sustainable Development Goals (UNSDGs), objective is to end starvation, accomplish food protection, and support reliable and sustainable agricultural science. By 2050, the world's population of 9.3 billion will require 60% more food than the current production, according to the Food and Agricultural Organization (FAO) [2, 3]. Crop production prediction can provide essential information to attain the goal and end hunger. However, many considerations influence productivity in the crop division, generating the future challenge of developing a consistent model for forecasting employing traditional procedures. Moreover, the prediction of yield for the novel design and training process should be associated with the developments of advanced computer technology. A further factor is that 51% of farmers wish to relocate to cities rather than pass on their agriculture to the next generation [4]. Finally, it is because farmers frequently choose the wrong crops, associated with crop selection that would not produce the soil or planting during the stormy season. The decision may have been made without prior knowledge because the farmer may have bought the land from someone else. Less yield will always result from choosing the incorrect crop. It is exceedingly challenging to exist if the family ultimately depends on this revenue. Machine learning is a significant resource for deciding crop growing season activities, crop types to be sown, and agricultural yield predictions [5]. Many practices are associated with machine learning strategies for the prediction enhancement of crop recommendation systems. Indeed, methods linked with supervised machine learning techniques are employed for industrial purposes, from analyzing store patron behavior to forecasting phone usage. Besides agricultural applications also, these strategies are applied [6]. Alongside, crop forecasting remains agriculture's crucial and most formidable problem, and numerous standards are designed combined with a tested case up to this point. This challenge demands the use of many datasets due to the complexity of the factors that affect crop yield, including atmospheric conditions, fertilizer type, soil, and seed. Farmers receive aid for opting the crop through the application of machine learning (ML) and data mining tools, taking into account factors like soil qualities, distinctive terrestrial regions, sowing season, and ecological considerations. Hence, it can be inferred that estimating agricultural productivity requires many intricate steps and is not a straightforward operation [7].

Figure 1 illustrates the recommendation system block representation. These algorithms' statistical characteristics can result in a sizable boost in yield. It is desired to have a high degree of precision, considering the inability to accomplish this will have significant damaging consequences, including seed and productivity losses and a substantial decline in profitability. Although prediction strategies of crop yield may currently reasonably estimate within the original yield base fetching over the performance outcome with the system, which is preferred [8]. Prospective investigators face obstacles in conducting country case findings due to challenges in accessing accurate and up-to-date information.

Fig. 1
figure 1

Block representation of the recommendation system

Main contribution of the research is to suggest the intelligent crop recommendation system for better yield prediction. Early crop yield prediction plays a significant role in averting famine by anticipating the amount of food available for the expanding global population. To predict the crop recommendation that would produce effective procedure in the actual land site, it employs three distinctive supervised-based machine learning prototypes: decision tree, random forest, and KNN. Consequently, the comparison is also conducted for the proposed methods to know the performance analysis and to conclude, with two separate metrics, entropy and Gini indexes.

Literature Survey

Several ideas have been put out in the past to increase crop prediction accuracy considering the importance of this field. Based on soil factors (potassium, Ph, and nitrogen) combined with constraints associated with the environment, as employed the feed-forward back propagation (FFBP)-based artificial neural network (ANN) technology for forecasting and modeling the crop yields in rural regions (rainfall, humidity, etc.) [9]. Many variables, including rainfall, groundwater, the area under cultivation, and the kind of soil, were considered in the analysis to ensure the highest crop productivity. The study examined three distinct strategical types for classification purposes that include fuzzy logic, SVM, and KNN, as well as modified KNN (MKNN). Whereas, for the clustering process, the K-means technique is employed [10]. MKNN provided with the good quality prediction result among the three algorithms following the analysis. Besides, the feasible development of the tool for the cultivators associated with the application of various challenges as the current industrialized world is facing [11]. Moreover, the farmers or cultivators utilize such advanced computer technology for the individual execution of the process and undergoing various investigational tests through the information entered such as location and time (Years) for the opted crop [12]. The user can select a procedure that determines and evaluates the actual outcome for the best input application within the system to recommend the user. The rate of crop's yield is visible in the productions, which are derived from datasets containing past database outcomes adapted into a consistent organization. Naive Bayes and KNN are the utilized machine learning models [13].

Data on crops spanning the last decade were collected for the dataset, sourced from various locations, containing authority sites. For humidity and temperature analysis, the electronic component utilized is soil and DHT11 sensors. Whereas, for the purpose of data gathering with the atmospheric analysis, the Arduino-Uno board integrated to Atmega CPU with collaboration of things (IoT) can be used.

Lekhaa et al. [14], Lobell and Burke [15], boosting technique, that employments frail rules through an reiterative method to transport augmented precision, remains for enhancement of Naive Bayes (NB) strategy, which is a supervised-based learning procedures attained model precision of about 96% as mentioned [16]. Operating with the stacking development under regression analysis for the improved estimate and forecast enhances the three regression techniques. The suggested approach within the system currently associated through the random forest, and Naive Bayes correspondingly were compared for the performance analysis [17]. Nevertheless, the model accuracy is bagging when compared to the random forest based on the probability function built with the system. Crop production is forecasting using several machine-based approaches with accuracy and rate of error comparison under specific areas [18]. Employing available reserves, an organization is introduced to address this challenge using predictions on crop sustainability. This system also offers recommendations based on ML models established, considering crucial conservation and commercial influences [19, 20].

Table 1 Existing crop prediction approaches

Table 1 lists the existing crop prediction approaches. Using 22 various sorts of featuring dataset within the crop variants, the study compares different supervised-based-learning practices that include decision tree, random forest, and KNN. Nonetheless, for the analysis, there are criteria such as the Gini index and entropy employed techniques of random forest and decision tree classifiers [10]. Based on precision, recall, and F1 score, the results demonstrate that the suggested machine learning technique performs as well as the best accuracy [5, 26]. Several machine learning algorithms, including KNN, support vector machine (SVM), Fuzzy logic (FL), artificial neural network (ANN), multiple linear regression (MLR), and random forest (RF) strategies, have been deployed on the provided datasets and evaluated for yield to accuracy. Comparisons are made between the algorithms’ accuracy [11] by Sunil et al. [19] and Van Klompenburg et al. [27]. The results demonstrate that random forest decision tree classifier, with a 99.40% accuracy rate, remains exact precision within the conventional strategies as deployed in the datasets provided.

Datasets and Methodology

Data accumulated and filtered from various resources are subjected to data collection. First, tracking past occurrences is kept safe for accurate data analysis for repetitive pattern detection. Then, the raw data are modified to a form for the functioning of MI techniques to provide better insights under the data pre-processing. Consequently, the splitting, fitting, and process of checking the range over the dataset trained, within the model prediction, thus generating the classification and confusion matrix report undergo.

Table 2 demonstrates the sample dataset fed to the methodology for executing the recommendation process. From the Kaggle website, the data are collected for the “crop recommendation". Moreover, the considered terms within the dataset of the entire 22 variants of crops linked to features and labels are phosphorus content ratio (P), temperature (°C), relative humidity %, nitrogen content ratio (N), rainfall (mtrs), pH value, and potassium content ratio (K).

Table 2 Sample dataset

Figure 2 depicts the steps for methodology. Besides, the sampling frequency ranging within the information set up to the actual and prediction will be assigned by the confusion matrix as false positive, false negative, and true positive, true negative. Furthermore, the statistical analysis developed within the assessment for the classification process developed for the prediction practices; thus, classification report is significant. It provides three results of the performance metrics. The capacity of a classifier to count the prediction count estimate linked for the significant analysis in the outcome that usually signifies the accuracy of the model termed as the rate of precision. The proportion for a true and false positive is computed for each class. Nevertheless, the classifier's recall determines the capacity for positive extraction built within the development of the confusion matrix. The proportion for true positives and false negatives is computed for individually. Moreover, the F1 score, with 0.0 being the poorest and 1.0 being the greatest, is a weighted harmonic-mean of recall and precision. F1 scores are frequently lower than accuracy assessments because recall and precision rates are factored in during computations. Furthermore, the prediction rates under the ratio of prediction count signified model accuracy. Figure 3 shows the confusion matrix.

$${\text{Precision}}=\frac{{\text{TP}}}{{\text{TP}}+{\text{FP}}}$$
(1)
$${\text{Recall}}=\frac{{\text{TP}}}{{\text{TP}}+{\text{FN}}}$$
(2)
$$F1 \, {\text{score}}=\frac{2\times {\text{Precision}}}{({\text{Precision}}+{\text{Recall}})}$$
(3)
$${\text{Accuracy}}=\frac{{\text{TP}}+{\text{TN}}}{{\text{TP}}+{\text{TN}}+{\text{FP}}+{\text{FN}}}$$
(4)

here TP–true positive, TN–true negative, FP–false positive, and FN–false negative.

Fig. 2
figure 2

Methodology steps

Fig. 3
figure 3

Confusion matrix

Proposed Study

K-nearest Neighbor Classifier

A straightforward machine learning process, the KNN relies on the supervised-based learning methodology [2]. There are two features of the KNN method. Figure 4 describes the practical steps for KNN. Initially, the distribution parameters do not need to be determined because of the dependency of the dataset on the model and, as a result, subjected as non-parametric. Secondly, it simply stores the training data; no learning is happening. Finally, the testing phase includes the classification of the dataset, which requires a lot of memory and increases testing time.

Fig. 4
figure 4

KNN classifier steps

$$\mathrm{Euclidean } \, {\text{distance}}=d\left(b,a\right)=\sqrt{\sum_{i=1}^{n}{({b}_{i}-{a}_{i})}^{2}}$$
(5)

KNN outcomes the supervised learning within the test case of the method fed to the system. For the quality process, it is termed as a lazy learner.

Decision Tree Classifier

Another supervised-based learning is the decision tree, which can be deployed for the case of classification analysis and regression approaches within the individual path associated with subject-leading decisions within the class [6, 12]. Nevertheless, the estimate decision is developed through the instance comparison over with the decision can be attributed to the outcome of split and jump for upcoming node region. Splitting continues, producing the sub-trees until the leaf node reaches. Thus, class labels were determined instantaneously. Therefore, recursive portioning occurs. The stepwise representation of the pictorial format for the decision tree strategy is shown in Fig. 5.

Fig. 5
figure 5

Decision tree strategy

Entropy and Gini indexes and information gain are determined for every split. Furthermore, the information gain signifies the measure in the rate of information acquired in the generated attribute factor and thus entropy minimization. Moreover, the degree of impurity within the given attribute is indicated by entropy and by subtracting the overall quantities of the squared term of probabilities within the domain of class counted with 1, consequently determining the Gini index of coefficient.

$$I\left({\text{Attribute}}\right)=\frac{\sum {y}_{i}+{n}_{i}}{y+n}$$
(6)

Here I(Attribute)—Nominal information, \({n}_{i}\)-no values, and \({y}_{i}\)—yes values for attributes, y & n are yes, no for overall sample counts.

$$\mathrm{Information } \, {\text{Gain}}={\text{Entropy}} \left(S\right)-I({\text{Attribute}})$$
(7)
$${\text{Entropy}}\left(S\right)=-H({\text{yes}})*{{\text{log}}}_{2}H\left({\text{yes}}\right)-H\left({\text{no}}\right)*{{\text{log}}}_{2}H({\text{no}})$$
(8)
$${\text{Entropy}}\left(S\right)=\sum_{I=1}^{N}-{h}_{i}{{\text{log}}}_{2}{h}_{i}$$
(9)
$$\mathrm{Gini } \, {\text{Index}}=1-\sum_{i=1}^{c}({H}_{i}^{2})$$
(10)

here S—sample, H(yes) and H(no)-probability chance of yes.

Random Forest Classifier

This approach entails multiple DT classifiers for the simulation's execution enhancement. It uses ensemble-based-knowledge theory [7, 21]. Randomly, decision trees are created within the training data to predict outcomes. The decision of the final prediction is made through a majority voting process.

Figure 6 demonstrates the strategy for the random forest classifier to apply with the recommendation system. At the current step, each estimator gives the determined prediction; thus, the category with the highest votes is allocated.

Fig. 6
figure 6

Random forests classifier strategy

Results and Discussions

In the model of KNN, the accuracy is dependent on the neighbors. Here, K = 5. From the figure, the significance of linear or straight-line is for the demonstration of prediction accuracy and that with the outer projections determining the case of lousy prediction estimates.

Figure 7 shows the analysis for the prediction of crop recommendation through the KNN approach. Within the selected criterion of Gini and entropy for the samples attributed to the average count of the information fed to the system at the maximum probability distribution.

Fig. 7
figure 7

KNN prediction

Figure 8 demonstrates the result analysis for the two criteria chosen for the classifier approach in the recommendation system. The decision tree with a measure of Gini showed a minimal count of dice projections associated to the entropy criterion. Thus, Gini outperforms the entropy under accuracy.

Fig. 8
figure 8

Analysis of Gini and entropy criterion's predictions

Figure 9 illustrates the decision tree strategy's prediction outcome compared with the actual and prediction case under the criteria linked to the system. While designing with the random forest strategy, the estimator count is set to 100, i.e., 100 decision trees. Alongside, the estimator count does not affect the model accuracy and thus accomplishes the best outcome. Random forest demonstrated equivalent predictions under both criteria.

Fig. 9
figure 9

Decision tree prediction

Figure 10 shows the prediction outcome for the strategy of the random forest classifier compared with the actual and prediction case under the criteria linked to the sustain. The random forest classifier.

Fig. 10
figure 10

Random forest prediction

Figure 11 illustrates the model accuracy comparison for the proposed methods in terms of their model accuracy, which is analyzed and tabulated with its training score fed with the recommendation system. Table 3 signifies the model accuracy and training scores comparison for the proposed methods in processing the recommendation system for the crop.

Fig. 11
figure 11

Model accuracy of proposed methods

Table 3 Model accuracy and training scores comparison for the proposed methods

Conclusion

The study introduces an efficient crop recommendation system using classifier models. To predict the crop recommendation that would produce effective procedure in the actual land site, it employs three distinctive supervised-based ML prototypes: decision tree, random forest, and KNN. Consequently, the comparison is also conducted for the proposed methods to know the performance analysis. To conclude, with two separate metrics, entropy and Gini Indexes, the study analyzes well-known procedures with K-nearest neighbor (KNN), random forest classifier, and decision tree practices. It is evident from the outcomes of crop prediction; the random forest classifier outperformed the other strategies in Gini- and entropy-based criterions with a greater accuracy rate of 99.04%. Contrarily, within the comparison, K-nearest neighbor attained the lowest accuracy (97.06%) and decision tree classifier (~ 98%). Regarding accuracy, the decision tree Gini criterion outperformed the decision tree entropy criterion with a score of 98.85%.