Introduction

Referring to [1], it appears that in developing countries the agriculture represents the essential source for many products and food employed by a multitude of organisms to keep and improve their lives, Their traditional methods’ freshwater use attains 85% for this purpose, managing limited freshwater appears as a challenging release in agricultural activities specifically when the water needs escalate at a fast pace, like in African countries, the crucial increase of population lowers the water availability, without forgetting that the majority of these countries are in semi-arid zones characterized with unstable rainfall, and long drought periods which reflects negatively on the crop yield. Appropriately, the irrigation task becomes very challenging for the farmer because of the water resource depletion.

Thus, different irrigation techniques have been employed to minimize the use of freshwater resources in agriculture, like flood irrigation that relies on covering the field with water, drip irrigation that decreases water distribution much better through a direct water supply to the root zone or delivering water to the soil surface over valves, pressure lines, and emitters, sprinkler irrigation that applies irrigation identical to natural rainfall through pumping using a system of pipes. Besides, it is perceived that there is a need for powerful resolutions are mandatory to outstrip this restriction. For this reason, researchers have been performing many advancements studies and achievements to invent novel water resources management systems benefiting from the recent advances on the internet of things (IoT), big data, and machine learning (ML). Thus, to support decision-making in the drip irrigation system, a smart system relying on soil image processing, field sensing, and mobile technologies to boost the irrigation planning by determining the amount of water required for the plants’ cultivation [2]. Likewise, some researchers in [3] proposed an automatic irrigation system that employs a GPRS module and wireless sensor network to optimize up to 90% of water use in comparison to traditional techniques. Furthermore, it appears that the most difficult task of researchers who use machine learning is the selection of the best model, data samples, and the corresponding data sets. Hence to come up with these issues, it is required to test different machine learning algorithms and training models with various data samples to determine the best one for an efficient prediction with the highest accuracy that is assumed by our proposed framework in our previous work in [4], in which we computed three forecasting methods specifically the ARIMA, the XGBoost, and the random forest.

Moreover, this paper provides an advanced online framework to implement more methods in soil features in different depths to deal with an accurate predictive model that anticipates soil parameter changes and better irrigation planning in agriculture standing on supervised learning, especially the deep artificial neural networks.

The remainder of this paper is composed as follows: the next section epitomizes a brief survey about the last smart irrigation systems and soil features forecasting facilities. The third section shows the followed methodology. The fourth section depicts the suggested  framework architecture. The fifth section outlines a case study of predicting soil moisture and the forecasting models implemented in our frameworks such as the auto-regressive moving average, the random forest, the eXtreme Gradient Boosting, and the deep artificial neural networks. As considerably, this paper discussed the performance analysis of the tested forecasting methods. The last section, encapsulate the prediction results and synthesizes a comparison of the outcomes of these predictive models. In the end, we conclude by examining the limits, advantages, and potential perspectives of this work.

Related Works

In the view of the past few decades, numerous predictive researches have been carried out to improve the efficiency in water resource supervision relying on the advances of machine learning algorithms and smart systems. Some research among them is focused on predicting actual evapotranspiration from time series analysis such as implementing various machine-learning methods on three types of evapotranspiration models with different input data. Among these methods, the application of M5P regression trees, bagging, random forests, and regression support vectors to data from an experimental site in Central Florida according to [5]. Likewise, a comparison of two types of streamflow modeling was performed [6] using machine learning algorithms. The first one is based only on climatic data (precipitation, temperatures, and potential evapotranspiration), the second one integrates also the previous flows in the data entrees. Many predictive models were tested to predict the river flows such as the multiple linear regression, the TUW hydrological model, the eXtreme Gradient Boosting, the Deep Learning Neural Network, and the Random Forest. The performance analysis was performed using the root mean square error, the R2 statistics, the Kling–Gupta efficiency, and the Nash–Sutcliffe Efficiency) statistics and perceptual bias. Three options have been employed to improve the precision of these flow simulation methods, to see the effect of the selected method on the accuracy of the results, the impact of feature engineering on the accuracy and the efficiency of the created models. Moreover, a smart irrigation decision support system (SIDSS) was done to manage irrigation of crops standing on a weekly estimation of water needs using soil measurements and weather parameters collected by divers autonomous nodes disposed of inside the field using ANFIS, and PLSR machine learning techniques referring to [7]. In the same context, a smart system based on open-source technology performs an algorithm based on K-means and SVR methods, has been proposed by [8], that provides the irrigation requirements’ forecast for the near future using both of sensing of the ground parameter (soil moisture, soil temperature …etc.), and the weather features (humidity, precipitation, UV, and air temperature) predicted for the near future available on the Internet. In this system, the data input is remotely sensed in the cloud using web services, and the acquisition of information insights is ensured in real-time based on sensors network and weather forecast through a decision support system tool and web visualization. Until today, a new decision support system based on models is invented in [9], that relies on wireless sensors network to collect real-time soil and environmental data, neural network algorithm to predict hourly soil moisture content requirements, and soil evapotranspiration benefitting from Blaney–Criddle method and fuzzy logic to monitor and control irrigation efficiency aligned with the weather and to generate and send adequate mobile notifications about irrigation needs into farmer by GSM modem integration. Consequently, water has been saved and yield has been increased appropriately. Recently, there are several models for analytics in machine learning like support vector machines, decision trees, random forests, artificial neural networks, and Bayesian networks used to support farmers in crop cultivation and intelligent farming. Otherwise, an overview about yield prediction based on agrarian factors and weather features compared supervised and unsupervised machine learning algorithms using various error patterns such as the root mean square error, the relative root mean square error, the mean absolute error, and the R2 determination coefficient as reported in [10].

In an irrigated area in northwestern Bangladesh dependent on groundwater, a study to evaluate the effects of climate change on the cost of irrigation for different RCP situations was directed in [11] applying a general circulation model (GCMs) for projecting the climate, an experiential hydrological pattern based on support vector machines for simulating the state of the groundwater from climatic variables, and a multiple linear regression to estimate the irrigation charge induced through the groundwater levels' fluctuation. The results reveal that the climate changes provoked declination in groundwater level which inflicted the increase of crop production cost less than other costs. To overcome over-or under-irrigation due to spatial changes in deep percolations, rainfall, runoff, irrigation, crop water use, and irrigation depth, and especially, to support decisions on sprinkler irrigation control, a site-specific integrated irrigation controller was invented which allows real-time monitoring of irrigation tasks through Bluetooth communication using an in-field wireless sensor network (WSN) and remote sensing of soil, canopy, air temperature, and soil moisture retrieved from cultivated fields. This system converts an automated irrigation machine from a traditional mechanical and hydraulic system to a controllable electronic system for individual sprinkler control, then, it monitors their geographic locations by a self-positioning system, and it finally makes a decision, when to irrigate and how much water to apply by each sprinkler head in a specific location. The WISC software was tested for in-field wireless sensor-based closed-loop irrigation control during the 2007 growing season under a linear-move irrigation system on a field planted to malting barley in the Eastern Agricultural Research Center of Montana State University in Sidney and it has succeeded to monitor remotely in real-time field conditions and control feedback for site-specific irrigation with a strong correlation of R2 = 0.98 with water captured by catching cans [12].

With the emphasis on the explosion of massive data analysis technologies, it can be noted that there is a multitude of free tools and libraries in python which made available to public access for machine learning, granting an efficient preparation (Numpy and Pandas, etc.) and deep and accurate data analysis and prediction in a reasonable time (PySpark, Keras, Scikit-learn, etc.), and easy results plotting (Matplotlib, Seaborn, etc.). A meaningful study in [13] has compared these libraries to select the better ones for each kind of data preparation, analysis, or prediction. They recommended the usage of Pandas for data preprocessing and manipulation, politely and seaborn and Matplotlib for data customization and visualization while they suggested for the Deep Learning, the usage of PyTorch or Keras for responsive prototyping, and TensorFlow for active customization. Also, they recommended the usage of Hadoop Streaming and PySpark in the field of big data (Table 1).

Table 1 An overview of predictive models and relevant features by Asmae El Mezouari and Mehdi Najib [4]

Methodology

We performed in our previous experimentations in [4] three machine learning algorithms, in particular, the extreme gradient boosting, the random forest, and the auto-regressive moving average for training the soil data set using various inputs features selected by the resampling method. In this paper, we trained the data set using also the deep ANN with the same resampling inputs. Then, we predicted the soil moisture in several depths for the test period. XGBoost is a novel technique invented by GBMs to boost the accuracy of predictive models benefiting from the predictive power of multiple learners by engaging the gradient boosting trees. In this algorithm, at each iteration, the final predictive model is the aggregated prediction from several weak learners and a new classifier is added to the previous learning models to reduce its errors. XGBoost is implemented in multiple programming languages in parallel with improving parameters as necessary related to [14]. Another model used in this paper is the random forest, that are a combination of multiple tree predictors that provide autonomous predictions using equivalent input data distribution, and at the end of computing, the highly voted predictions are selected as a final output. Random forest empowers an enormous set of weak classifiers to build a robust classifier [15]. ARIMA is a statistic-based model used for prediction based on linear time series analysis upon which the final predicted results are the product of many past examinations and random error [16]. According to [17], the ARIMA overcomes non-stationary time series issues using the differentiation technique of an order (d). ARIMA relies on two models, the pure auto-regressive model represented by lag order (p) and the moving average model expressed by order of the moving average (q). The last predictive model is the deep artificial neural networks that processes the data using multiple layers in the network, in a way similar to the human brain information processing in biology [18]. Deep learning is distinguished from the basic artificial neural networks in the way that the learning nodes are autonomous and can independently train and process the data itself to improve its learning and intelligence.

Framework Structuring

The proposed framework is invented for soil diagnosis and prediction to boost irrigation scheduling and to ease decision-making in agriculture. Furthermore, it aims not only to allow upload and integrate dataset, to provide predictions but to interpret and export the best results through a web interface as well. The Fig. 1 represents an illustration of the proposed framework composition.

Fig. 1
figure 1

An overview of the proposed framework

This framework could be described as follow, the first data integration module aims to import the soil and environment data extracted in a MongoDB database, from the environment and soil sensing into the data storage unit of the Hadoop ecosystem (HDFS); while the second data processing module intends to perform parallel predictions through processing different machine learning algorithms, in particular, the ARIMA, XGBoost, random forest, and deep artificial neural networks to select the best predictive model after analyzing their performances, and to export the results through a web interface. The resulting predictions would be used for predicting the soil and environment features that would be employed for water need estimation and irrigation planning. It allows adding other predictive models, features, and performance measures to perform optimum predictions in a fast way.

Materials and Methods

Data Processing Tools

Dealing with big data storage and analysis in the agricultural field is a challenging subject, especially when we are talking about aligning with climate changes; we are automatically oriented to lift complications related to retrieving knowledge from climatic and hydrologic historical data. For this reason, we have chosen to integrate data inside the Hadoop using MongoDB Connector, HDFS to read from and write data to the disk, and Apache Spark to speed up machine learning processing via in-memory computation (RAM) as is shown in Fig. 2.

Fig. 2
figure 2

An overview of the proposed techniques [4]

In this paper, we focus on testing different machine learning techniques such as ARIMA, XGBoost, random forest, deep ANN to make predictions in parallel and on a distributed scale and to analyze performance to select and save the best predictive model.

For machine learning processing, spark deployed an open-source and powerful library called MLlib that makes it scalable and wieldy [19]. As with each basic predictive model implementation, and after integrating the data in HDFS using MongoDB Connector; we performed our predictions in Hadoop Spark using python (PySpark) by following the next steps, using diverse libraries (Sparktk, XGBoost4J-Spark, Tree, Keras, …) for each algorithm as needed:

  • Step 1: Reading data file from h in spark, constructing the data frame, and using time method to invoke time.

  • Step 2: Splitting the dataset into train data and test data.

  • Step 3: Converting the data into vectors using Vector Assembler.

  • Step 4: Transforming the vectors into necessary data frames.

  • Step 5: Building and fitting the model with the training and test data to train it.

  • Step 6: Making predictions on the test data.

  • Step 7: Calculating performance measures (MAE, MSE, RMSE, R2-accuracy) to evaluate the accuracy of the model.

  • Step 8: Exporting and conceiving results.

Performance Measures

We measured the performance of these predictive models using the mean absolute error, the r-squared accuracy, and the root mean square error.

RMSE

The RMSE is the measure of the residuals within divined and perceived values. In general, the decrement of this measure describes that the precision is high. In our case, all predictive models have the same output feature (irrigation amount). Consequently, the root mean square error is sufficient to assess their performance. It is calculated using the formula:

$${\text{RMSE}} = \sqrt {\mathop \sum \limits_{i = 1}^{n} \frac{{\left( {yi - \overline{{y_{i} }} } \right)^{2} }}{n}} ,$$
(1)

where n is the count of the data, yi is the ongoing output of instance i, and \(\overline{{y_{i} }}\) is the corresponding ending estimation.

MAE

The mean absolute error measures the absolute deviation between the true and the predicted values. This means that the results that have a negative sign are ignored. MAE is calculated as

$${\text{MAE}} = \mathop \sum \limits_{i = 1}^{n} \frac{{\left| {yi - xi} \right|}}{n}.$$
(2)
R-Squared Accuracy

R-squared is the fraction by which the variance of the dependent variable is more than the variance of the errors. It describes the square of the correlation between the observed and estimated variables,

$$R^{{2}} \;{\text{accuracy}} = {\text{Explained}}\;{\text{variation}}/{\text{total}}\;{\text{variation,}}$$
(3)

The R-squared accuracy is a percentage between 0 and 100:

  • 0 tells that the model did not interpret the variability in predicted data around its mean.

  • 100 proves that the predictive model reveals completely the variability in the independent variable around its mean.

Case Study

In the present case study, we tried to test and compare the accuracy of the ARIMA, Random Forest, XGBoost, and Deep ANN methods representing the fundamental shaft of scheduling irrigation. For this purpose, we used a 5 years’ real time-series of hourly soil moisture and temperature data, sensed in five depths (5, 20, 35, 50 and 75 cm) from the rain-snow transition zone, the Johnston Draw catchment, Reynolds Creek Experimental Watershed, and Critical Zone Observatory, USA [20]. This Dataset contains over 35,064 records from 10/1/2010 to 09/30/2014. We have tested these methods in the data of the 5 cm depth. Figure 5 illustrates the trends of the studied time-series (Fig. 3).

Fig. 3
figure 3

Times-series of hourly soil moisture and temperature in 5 depths (USA) [4]

Results and Discussion

ARIMA Model Selection

ARIMA model selection consists of specifying the three parameters p, d, and q. In the first step, we started by the determination of the parameter “d” that represents the differencing order needed to make the time series stationary. This parameter is determined by the autocorrelation diagram analysis of the time series. Figure 4 shows a regular decrease in the autocorrelation values which indicates that the initial time-series is not stationary. In our case, the first differencing order was adequate to make it stationary, and the null autocorrelation value in Fig. 5 proves this hypothesis.

Fig. 4
figure 4

AC and PAC of non-stationary soil moisture in 5 cm depth [4]

Fig. 5
figure 5

AC and PAC—first-order differencing soil moisture in 5 cm depth [4]

The auto-correlation (AC) and the partial-autocorrelation (PAC) diagrams are used to determine the "q" parameters of the MA model, and the "p" parameter of the AR model. Referring to Fig. 5, any order of the AC diagram exceeds the confidence level. Thus, parameter q is equal to 0. Based on the analysis of the PAC in Fig. 4, we can observe that the first two autocorrelation values exceed the confidence level. As a result, the “p” parameter can be equal to 0, 1 or 2. To select the best configuration, we tested the accuracy of all combinations of the three parameters like the following configurations ARIMA (1, 0, 0), ARIMA (1, 1, 1), and ARIMA (2, 0, 2) using out-of-time cross-validation.

In out-of-time cross-validation, we can go back a few steps in time and predict the future for as many steps as we have taken. Then we perform the comparison between the forecast and the perceived data. To do out-of-time cross-validation, you need to build a training and test dataset by dividing the time series into two adjoining parts near the 75:25 ratio or a reasonable distribution based on the time frequency of the series. The obtained results show that the last model outperforms the other models in terms of accuracy. Consequently, ARIMA (2, 0, 2) is chosen for soil moisture prediction. Figure 6 portrays the results of the soil moisture prediction for the tested last year using this model.

Fig. 6
figure 6

The 5cm depth’ soil moisture prediction using exclusively the soil moisture parameter [4]

XGBoost Model Selection

In this step, we trained XGBoost Regression Model using soil moisture and temperature of four years in the 5 cm depth and then we used only the soil moisture parameter in the same depth. After that, we performed predictions for the last year. The trend in Fig. 7 illustrates the results of prediction using soil moisture and temperature parameters and the trend in Fig. 8 shows the result of the prediction using only soil moisture parameter.

Fig. 7
figure 7

XGBoost soil moisture prediction in 5 cm depth using soil moisture and temperature parameters [4]

Fig. 8
figure 8

XGBoost soil moisture prediction in 5 cm depth using only soil moisture parameter [4]

Random Forest Model Selection

In this step, we trained the random forest regression model using soil moisture and temperature of 4 years in the 5 cm depth, and then we used only the soil moisture parameter in the same depth. After that, we performed predictions for the last year. The trend below in Fig. 9 shows the result of prediction using soil moisture and temperature parameters and the trend in Fig. 10 shows the result of prediction using only soil moisture parameter.

Fig. 9
figure 9

Random forests soil moisture prediction in 5 cm depth using soil moisture and temperature parameters [4]

Fig. 10
figure 10

Random forest soil moisture prediction in 5 cm depth using only soil moisture parameter [4]

Deep Artificial Neural Network Model Selection

In this step, we trained deep artificial neural network model using the first four years’ soil moisture and temperature in the 5 cm depth, and then we used only the soil moisture parameter in the same depth. After that, we made predictions for the last year.

We created our Deep ANN (multi-layer perceptron) using the Keras sequential model combined with the rmsprop optimizer, which is a very popular optimization algorithm. We also employed an input layer of the ten relevant features for the first sampling input (soil moisture, soil temperature, hour, day of the week, quarter, month, year, day of the year, day of the month, and the week of the year) and an input layer of the 9 relevant features for the second sampling input (soil moisture, hour, day of the week, quarter, month, year, day of the year, day of the month, and the week of the year), with the activation function relu and using 100 hidden units for all experiments. Moreover, we added a hidden layer with 60 hidden units, with the activation function relu and an output layer for predicting the target feature (soil moisture). Likewise, we used the MSE as a loss function, the MAE, and the accuracy as evaluation metrics. Figure 11 shows the ANN model used to perform prediction with the described typical configuration for linear regression. After trying many configurations, like different random hidden units, and various activation functions, we found that this model is the appropriate option in terms of accuracy.

Fig. 11
figure 11

The ANN model’s configuration

The trend below in Fig. 12 shows the result of prediction using soil moisture and temperature parameters and the trend in Fig. 13 shows the result of prediction using only soil moisture parameter. It seems that the curve representing the forecast of soil moisture based solely on soil moisture follows a quite precise trend close to the real values.

Fig. 12
figure 12

Deep ANN soil moisture prediction in 5 cm depth using soil moisture and temperature parameters

Fig. 13
figure 13

Deep ANN soil moisture prediction in 5 cm depth using only soil moisture parameter

Synthesis

The tests carried out by the ARIMA (2, 0, 2), the XGBoost, the random forests, and the deep ANN algorithms show that both methods are accurate for forecasting soil moisture. The calculation of the root mean square error, the mean absolute error, and the R-squared accuracy prove the efficiency of these models, Table 2 shows the evaluation matrix.

Table 2 Confusion matrix of the evaluated models

Finally, we find that the deep ANN outperforms both models in terms of accuracy in predicting using only the soil moisture parameter and this is maybe related to the homogeneity in the data that reinforces the learning of the deep ANN model. In addition, the appropriate configuration of the model like increasing the number of epochs minimizes the errors. Likewise, the activation function injects nonlinear properties into the network to learn any complex relationship between input and output; and that represents one of the principal interesting concepts in universal approximation implemented in the ANN model which improves the efficiency of the model. Besides, we observed clearly that XGBoost and Random Forests give the same accuracy whether if we used the soil temperature parameter or not.

Conclusion and Perspectives

In the present paper, we engaged in improving our proposed framework that allows us to compute various predictive algorithms over the soil variables in different depths, by adding the Artificial Neural Network model in the processing phase. The main goal of this experimentation is to select the most accurate predictive model that will anticipate the soil state changes and help farmers in aligning irrigation scheduling with climatic changes; and basically, in improving the yield in agriculture, benefitting from various supervised learning machines by comparing their efficiencie. Hence, to select the best predictive model that would improve the irrigation planning, we have compared four forecasting models especially ARIMA, XGBoost, Random Forests, and Deep ANN in terms of several precision measures such as the MAE, the RMSE, and the R-squared accuracy..

Moreover, to approve this solution, we tested these forecasting methods upon a real-time series of soil moisture and temperature in the USA and we have examined their efficiency using different performance measures. Based on the results, we found that both ARIMA, XGBoost, Random Forests, and Deep ANN models provided accurate predictions. However, the Deep ANN outperforms all models in terms of precision in all cases.

As a perspective, we suppose that employing such efficient and powerful processing and predictive tools in forecasting the soil state could support irrigation planning in the short and the long terms. Also, testing other predictive models in future work could improve the results, particularly while integrating different parameters and validation processes.