Keywords

1 Data Analytics in Agriculture

By 2050, the world is expected to face a substantial increase in the global demand for food, necessitating a significant boost in food production by as much as 25% to 70% (Hunter et al. 2017); for this reason, it is crucial to double food production per hectare by the time the world population stabilizes around 2100 (United Nations 2019). Food security is a fundamental global need, threatened by population increase, climate change, decreasing arable land, food waste, and living standards that focus on consumer preference for animal protein (White et al. 2021).

Increasing agriculture or food production rapidly is difficult (Ahmad and Huang 2021). For this, the agricultural sector needs to employ cutting-edge technologies such as cloud computing, Internet of Things (IoT), Big Data, and machine learning (ML) (Ahmad and Huang 2021; Gopal Maya 2020). Through these technologies, data-driven agriculture is the most promising approach to solving these current and future problems( Ahmad and Huang 2021), as it improves crop yields, reduces costs, and ensures sustainability (Torky and Hassanein 2020).

Digital agriculture, like agrotechnology and precision agriculture, is a new scientific discipline that promotes agricultural productivity while minimizing environmental impact through data analysis (Liakos et al. 2018). Data are extracted from farm operations using various sensors, satellite imagery, videos, and photographs. This is possible as data analysis enables more accurate decisions through better knowledge of crop dynamics, weather conditions, soil, and farm machinery use (Liakos et al. 2018).

As the number of smart sensors and machines on farms increases and a wider variety of data is used, farms will become increasingly data-driven, enabling the development of smart farming (Sundmaeker et al. 2017). The difference between precision farming and smart farming is that the former was developed for farm management, and the latter considers real-time situations triggered by an event (Wolfert et al. 2017). On the other hand, smart farming includes intelligent assistance in implementing, maintaining, and using information technology (IT), enabling farmers to react quickly to sudden changes, such as disease alerts or weather events (Nandyala and Kim 2016).

Li et al. (2020) explain that Agricultural Big Data belongs to a comprehensive, cutting-edge technology, as it contains specific concepts, technology, and measures, covering the whole range of agricultural activities, such as farming and planting. This technology allows the processing of a large amount of heterogeneous data such as lighting, temperature, the humidity of crop growth, and data on all aspects of the production process (Li et al. 2020). With the characteristics of informatization, intelligence, and precision, it can solve the problems encountered in traditional agriculture and provide new support for agricultural development. Agricultural Big Data can respond in the new era and promote the structural reform of the agricultural supply side (Li et al. 2020). However, the research on Agricultural Big Data is in the initial stage, so more researchers are needed to do more research and analysis (Li et al. 2020).

According to Gopal Maya (2020), due to the multimodal nature of data, it has several challenges, such as improving methods for data collection and selecting effective statistical and data analysis techniques to understand and support agricultural activities. To improve these aspects, the mechanism used in smart agriculture is ML, the scientific field that allows machines to learn without much programming. It has emerged along with Big Data technologies and high-performance computing to create new opportunities to facilitate, quantify, and understand data-intensive processes in agricultural operating environments (Gopal Maya 2020).

Experts indicate that agriculture can benefit from ML at all stages, such as spice management, field management, crop management, and livestock management (Gopal Maya 2020). ML is used in several agricultural applications, including yield prediction algorithms based on weather and historical yield data, image recognition algorithms to detect pests and diseases in plants, and robotics to harvest different types of specialty crops (Tibbetts 2018).

Agricultural Big Data is playing an important role by incorporating ML. Farmers are using data to calculate crop yield, fertilizer demand, cost savings, and even to identify optimization strategies for future crops (Gopal Maya 2020). For the case of crops, ML is being used for yield prediction, disease detection, weed detection, crop quality, and species recognition. In the case of livestock, it is being used for animal welfare and livestock production (Liakos et al. 2018). In this chapter, we explain some practical examples of their use.

2 Data and Storage in Agriculture

2.1 Data

In Agricultural Big Data and ML, structured, semistructured, and unstructured data are often used, which adds complexity to the analysis process, as their use poses a significant challenge (Saiz-Rubio and Rovira-Más 2020). Unstructured data come from archives, such as videos, satellite images, and surveys, which contain a large amount of information hidden from the data scientist and cannot be analyzed directly. On the other hand, semistructured data have been stored in spreadsheets and repositories containing both essential and unimportant data for the desired analysis. Therefore, it is also necessary to process these data to obtain essential structured data, allowing data scientists to perform the relevant analyses (Cravero et al. 2022a).

Unfortunately, processing unstructured data is not trivial, as it requires the use of specialized tools and the knowledge of subject matter experts. It also requires selecting the right types of repositories and databases for further processing and analysis (Šuman et al. 2020). Therefore, it is essential to identify available data, necessary processing, and potential studies based on the generated data, as ML requires test datasets of sufficient quality to achieve the expected learning (Bhatnagar 2018).

According to Nandi and Sharma (2020), the analytics that can be performed using ML can be descriptive, diagnostic, predictive, and prescriptive. Prescriptive analytics is the most complex, as it is responsible for finding a solution among several variants to optimize resources and increase operational efficiency: the more complex the studies to be performed, the more complex the data processing will be.

According to Firdaus and Hassan (2020), it is essential to know the data type before applying any algorithm. Therefore, data type plays a vital role in preprocessing and visualization. There are four main types of data: numeric, categorical, time series, and text. Numerical data are further classified into continuous and discrete. Categorical data types represent quality; concepts such as “good”, “bad,” and others define levels. These data must be processed to be described as numbers rather than text.

2.2 Agricultural Data

Data in Agriculture refer to variables or attributes that farmers need to carry out their business activities. The data can be specific agricultural records or parameters, such as crop varieties, yields, soil types in use, and acreage, and business-related information, such as products, suppliers, customers, and payments. They are classified into structured, semistructured, and unstructured data, depending on the storage format in which they are stored (Cravero et al. 2022a).

The data used in Agricultural Big Data come from sensors, IoT, satellites, cameras, global positioning system (GPS), databases, and data from farmers’ expert knowledge. Figure 1 shows the main data used using a concept cloud. It can be seen the most used data are temperature, humidity, crop area, wind direction, and wind speed.

Fig. 1
A word cloud where the words with the most occurrence are temperature, cultivation area, humidity, wind direction, and wind speed.

Cloud of data concepts used in Agricultural Big Data

An example of the use of data is the work of Nóbrega et al. (2018), in which they used video data to learn about sheep behavior in a vineyard. To do so, they analyzed each video to obtain a set of rules processed by experts. Yang et al. (2017) used video data to monitor plant growth status. Rehman et al. (2020) used various types of data for crop analysis, allowing them to understand soil temperature, atmospheric temperature, and humidity, as well as data from sensors.

Another example is the work of Dutta et al. (2015), who used heterogeneous data from different sources to generate a knowledge system of agricultural processes in conjunction with environmental processes. Data are obtained from a sensor network, large-scale simulated models, satellite imagery, meteorological data, and industry knowledge and experience to improve decision-making. The authors developed a Big Data system that incorporates unstructured, undocumented, and ad hoc knowledge into a structured rule base that allows for an improved decision support system.

On the other hand, Amani et al. (2020) also used satellite images to obtain information on terrain characteristics. The data are extracted directly from Google Earth Engine (GEE), as it improves the efficiency of data processing from the point of view of time and costs. In addition, GEE contains freely available remote sensing datasets and several classification algorithms, which can be accessed for various farmland applications.

Figure 2 shows the number of uses of different data sources for the generation or collection of Agricultural Big Data (y0, data type). The identified sources were categorized into six groups. These are sensors, cameras, databases, GPS, satellite, and people. Each of the groups is described below.

Fig. 2
A horizontal bar chart plots the data sources in agricultural big data. The values for database, sensors, satellite, camera, G P S, and person are 22, 17, 10, 6, 2, and 2, respectively.

Data sources used in Agricultural Big Data

Satellites are an essential data source for obtaining data on sizeable agricultural land. An array of sensors attached to the satellite is used to capture the data, from which numerous products can be obtained, such as optical, synthetic aperture radar (SAR), or thermal images. Cravero et al. (2022a) identified the use of six different satellites: Google Earth (Amani et al. 2020), Sentinel-1 (Shelestov et al. 2020) and Sentinel-2 (Sitokonstantinou et al. 2020), Landsat 7 and Landsat 8 (Dutta et al. 2015), and MODIS (Dutta et al. 2015).

The sensor group includes all IoT devices used statically in different locations to capture data. Many sensor types measure a single variable, such as temperature, radiation, and precipitation (Gnanasankaran and Ramaraj 2020). Similarly, devices that include several sensors, such as a weather stations or collars, are used in animals (Nóbrega et al. 2018). Using this data source usually requires dealing with IoT devices’ deployment, connection, and maintenance. Some advantages of using sensors are that the data obtained are particular to the area or task in which they are used and that they will be captured in real time.

On the other hand, the temporal resolution of sensors tends to be low, from seconds to minutes, so large amounts of information are usually generated in a particular measurement period (Yang, J. et al., 2018). Wang and Mu explain that microsensors capable of capturing data on crop growth, land use, water use and characterization, and climatic variables, among other essential aspects, are being developed. The authors conclude that using these microsensors will enhance the development of artificial intelligence (Wang and Mu 2022).

Unlike sensors, databases allow easy and immediate access to a large amount of historical data, with accumulated records of up to 10 years. The vast majority of the identified databases are managed by public entities or government agencies, such as AWAP, CosmOZ, SILO, ASRIS, BOM, ISTAT, CNIR, IndiaStat, AAFC, ARPAS, ACIS, IMD, OGD, and KME (Cravero et al. 2022a).

2.3 Massive Storage

Two primary technologies have been employed in Big Data for massive storage. However, relational databases (RDBMS) prepared to process in-memory data and NoSQL databases that store unstructured data have also been used.

Apache Hadoop is an open-source data processing ecosystem used for distributed computing, which has been created to address Big Data problems. In addition, Hadoop has been expanded to use geospatial data. Hadoop generally contains a Hadoop Distributed File System (HDFS) and a MapReduce programming environment for data processing (Alkathiri et al. 2019).

Cloud computing provides various services over the Internet that are scalable. This technology allows resource sharing using the infrastructure owned by a cloud service provider. The provider’s users or customers can access resources on demand by paying per use. It enables the abstraction of infrastructures, such as storage, network, and applications, through its three services: Platform as a Service (PaaS), Infrastructure as a Service (IaaS), and Software as a Service (SaaS) (Odun-Ayo et al. 2018). The fourth layer of services is Business Intelligence (BI), which contains applications to measure management indicators.

2.4 Massive Storage in Agriculture

Cravero et al. (2022a) analyzed 36 papers where Big Data and ML are applied for analysis in agriculture. Figure 3 shows the distribution of uses of the mentioned platforms, categorized into Hadoop, relational database, NoSQL database, and cloud.

Fig. 3
A horizontal bar chart plots the mass storage used in agricultural big data. The values for Hadoop, Relational D B, Cloud, and No S Q L D B are 11, 6, 5, and 4, respectively.

Mass storage used in agriculture

The cloud category includes several cloud computing services, such as AWS (Amazon Web Services) or GEE (Google Earth Engine). A direct advantage of using these platforms is their large computational and storage capacity. They are suitable for working with Big Data and can be resized according to the user’s needs. Another benefit is the free and direct access they provide to different data sources, such as satellite data captured by Landsat or Sentinel satellites.

Shelestov et al. (2020) used AWS’s fast and easy access to Sentinel-1 and Sentinel-2 satellite imagery to work with datasets using up to 3 TB of memory space, eliminating the problems associated with downloading and storing data related to Big Data. Gumma et al. (2020) list the following reasons for using the GEE platform: easy access to Landsat satellite data, the powerful computational capability of the service, and the ability to perform parallel processing of the data, among others.

Wang et al. (2019) use MongoDB, a document-oriented database, as intermediate temporary storage for data collected by sensors, which are subsequently transferred to an implemented data warehouse. Sathiaraj et al. (2019) used the in-memory database REDIS, whose data model is key value, for the visualization system of the computational analyses performed, as it presents low latency when accessing the data.

Nóbrega et al. (2018) use PostgreSQL to store the data obtained by the collars placed on each sheep. They decided to use a relational database because their sensor collar network has several entities that can be efficiently designed in this database. Furthermore, they selected PostgreSQL among the available RDBMS options because it suits environments with system–critical data, security, and integrity mechanisms.

3 Analysis in Agriculture

3.1 Agricultural Big Data

Big Data is defined in four dimensions (4 Vs). The first V refers to the enormous volume of data being developed, stored, and processed. The second V refers to the high speed of data transmission in interactions and the rates at which data are generated, collected, and exchanged. The third V refers to the variety of data formats and structures (structured, semistructured, and unstructured) resulting from the heterogeneity of data sources (Sassi et al. 2019). The fourth V is veracity, which refers to the ability to validate the data quality used in the analyses.

Apart from the “4 Vs”, another dimension of Big Data, its value, must also be considered. Value is obtained by analyzing data to extract hidden patterns, trends, and knowledge models through intelligent data analysis algorithms and techniques. Data science methods increase the value of data, providing a better understanding of its phenomena and behaviors, optimizing processes, and improving discoveries by machines, companies, and scientists. Therefore, we cannot consider Big Data science without including data analytics and ML as critical steps to numerate the value among Big Data science strategies (Elshawi et al. 2018).

In practice, Big Data analytics tools enable data scientists to discover correlations and patterns by analyzing massive amounts of data from different sources. In recent years, Big Data science has become an essential modern discipline for data analytics (Elshawi et al. 2018). It is considered an amalgamation of classical disciplines such as statistics, artificial intelligence, mathematics, and computer science, with its subdisciplines including database systems, ML, and distributed systems (Haig 2020).

Big Data in agriculture refers to all the modern technology available combined with data analysis as a basis for making decisions based only on data (Sarker et al. 2019). The following typology will help us to understand the Big Data evolution (see Fig. 4).

Fig. 4
A block diagram relates the topology of Big Data evolution. The agricultural big data involves automated agriculture, enterprise agriculture, prescription agriculture, and precision agriculture.

Topologies in digital agriculture

Precision agriculture collects real-time data on farm elements such as crops, air, and soil to protect the environment while ensuring profits and sustainability (Micheni et al. 2022). Incorporating ML techniques in farming has advanced aspects such as crop and soil health, irrigation systems, crop disease identification, weed control, and recommended control measures. The adoption of a robotic farming system has a significant impact on crop production, efficiency, and sustainability. However, the success of precision farming is hampered by factors such as lack of training, low return on investment, high costs, and lack of Big Data analysis of precision farming.

Big Data has been used to improve various aspects of agriculture, such as knowledge about weather and climate change, land, animal research, crops, soil, weeds, food availability and security, biodiversity, farmer decision-making, insurance and farmer financing, and remote sensing (Kamilaris et al. 2017). It is also used to create platforms that enable supply chain actors to access high-quality products and processes; tools to improve yields and predict demand; and advice and guidance to farmers based on the responsiveness of their crops to fertilizers, leading to better fertilizer use. It has also led to the introduction of plant scanning equipment to track deliveries and enable retailers to monitor consumer purchases, improving product traceability throughout the supply chain (Wolfert et al. 2017).

Big Data has been used with other technologies such as ML, cloud-based platforms, image processing, modeling and simulation, statistical analysis, normalized difference vegetation index (NDVI), and geographic information systems (GIS) (Kamilaris et al. 2017).

There are Big Data solutions for different areas of agriculture, such as farmer decision-making, crops, animal research, land, food availability and security, weather and climate change, and weeds (Cravero et al. 2022b).

For example, Boudriki Semlali and El Amrani (2021) used Big Data tools to monitor atmospheric composition. The system architecture contains the data source layer, ingest, storage using Hadoop, data management layer, infrastructure, and monitoring and security layer. In addition, they used data on pollutant gas emissions from other sources, such as agriculture, business, and transportation. As a result, the authors could continuously monitor the atmospheric composition by remote sensing. Figure 5 shows the complete process.

Fig. 5
A block diagram depicts the architecture of Big Data. It comprises the data sources, ingestion layer, visualization layer, Hadoop management layer, analytics engines, data warehouse, Hadoop storage layer, and monitoring layer.

Architecture of Big Data for atmospheric composition monitoring

Another example is Alex and Kanavalli (2019), who developed a Big Data system that predicts whether fertilizers will cause disease in crops. They used data such as soil moisture, average rainfall, and soil nutrients. The authors also used data such as phosphorus (P), nitrogen (N), magnesium (Mg), calcium (Ca), and sulfur (S). The Big Data process starts with data enrichment, followed by data clustering, so the data can be classified and analyzed to deliver recommendations. Finally, the Hadoop ecosystem was used to store and process the data analyzed with ML. Figure 6 depicts the complete process.

Fig. 6
An illustration. The expert system in Big Data architecture consists of three phases, data enrichment, clustering, and classification. The system is obtained from the data storage from the sensor data collection.

Big Data architecture for fertilizer management and yield prediction

Big Data enables data scientists and farmers to understand agricultural behavior, such as climate, land, soil, crops, animal production, weeds, food safety, biodiversity, remote sensing, farmer decision-making, insurance, financing, and climate change. It also enables the development of supply chain platforms, which allow players to access high-quality products, processes, and tools capable of improving yields, predicting demand, and targeting farmers based on crop needs, such as appropriate fertilizer use.

3.2 Agricultural Big Data Technologies

The technologies used for the implementation of Big Data and ML systems in agriculture are presented in Fig. 7. It can be seen that the most frequently identified technology was Apache Spark (Cravero et al. 2022b).

Fig. 7
A word cloud where the words with the most occurrence are Apache, H D F S, Spark, JavaScript, Python, and Cloud.

Main technologies used in Agricultural Big Data and ML

For the collection of agricultural data, the most common technologies are sensors and satellite imagery. The former is used to take data on the location of interest of crops, animals, weather, or soil properties (Donzia and Kim 2020; Priya and Ramesh 2020; Nóbrega et al. 2018). Satellite imagery is used for monitoring land and crops over large areas (Amani et al. 2020; Gumma et al. 2020) (Sitokonstantinou et al. 2020). Data are obtained through satellites or services from external providers such as Google Earth Engine (GEE), a global positioning system (GPS), Sentinel satellites, Landsat satellites, or Google Maps.

In the implementation of Big Data systems, the most used file system is Hadoop Distributed File System (HDFS), because it allows to separate of datasets, storing them in a distributed way in several nodes of a cluster, and parallelizing operations on them (Sitokonstantinou et al. 2020). Most of the implemented clusters were configured with the various programs provided through the Apache Hadoop framework. Among these, the following stand out: Apache Hive and Apache Kafka. Apache Hive is used to configure data warehouses that streamline working with large datasets stored in distributed units (Wang et al. 2019) (Pandya et al. 2020). Apache Kafka is used to transmitting information or messages to different nodes of the designed Big Data architecture (Pandya et al. 2020; Donzia and Kim 2020).

The technology most often identified in the implementation of ML models is the Python programming language (Balducci et al. 2018; Gumma et al. 2020; Fenu and Malloci 2019). It highlights the latter especially together with libraries such as D3 for data visualization (Sathiaraj et al. 2019), Leaflet.js for displaying maps (Doshi et al. 2018), and React for building interactive user interfaces (Sathiaraj et al. 2019).

In (Wang et al. 2019), a Big Data system for agriculture was proposed and designed based on data collection, storage, analysis, and application. For collecting pear tree growth data (air temperature, soil moisture, light intensity, etc.), a high-precision wireless sensor network is used, sending collected data via TCP protocol to traditional databases (MySQL, MongoDB, etc.). These databases are used temporarily to store the data and serve as data sources for the overall Big Data system. For this purpose, data synchronization software such as NiFi, Sqoop, or Flume is used. Data sources are synchronized with the HDFS cluster responsible for storing all the data together. SparkSQL reads, filters, and stores data from the HDFS cluster to Apache Hive and Apache Hbase. The former is employed for data used for analysis, and the latter is utilized for data monitoring and visualization of data statistics. Apache Dubbo is used for running farmer management services in a distributed manner (Cravero et al. 2022a).

3.3 Machine Learning in Agriculture

ML is a highly interdisciplinary field based on different areas such as artificial intelligence, optimization theory, information theory, statistics, cognitive science, optimal control, and many other scientific, engineering, and mathematical disciplines (Cherkassky and Mulier 2007). ML has covered almost all science domains, impacting society significantly (Rudin and Wagstaff 2014). It has been used in various problems, including recommendation controllers, computer science and data mining, recognition systems, and autonomous control systems (Qiu et al. 2016). In general, ML is used to optimize the performance of a task through mining past examples or experiences, as it can generate efficient relationships concerning data inputs and reconstruct a knowledge schema.

ML has been used to solve different problems in agriculture, such as crop management, including yield prediction; disease detection, weed detection, crop quality, and species recognition; livestock management, including animal welfare and livestock production; water management; and soil management (Liakos et al. 2018; Benos et al. 2021; Bal and Kayaalp 2021).

An example of its use is in precise detection, as together with sensors, it allows accurate detection and identification of weeds without causing environmental problems or side effects. ML for weed detection has led to the development of tools and robots to destroy weeds, minimizing the need for herbicides (Liakos et al. 2018). In addition, accurate detection and classification of crop quality characteristics have increased the value of products and reduced waste.

The increased research interest in ML in agriculture is a consequence of several factors: the considerable advances in IT systems in agriculture; the vital need to increase the efficiency of farming practices while reducing the environmental burden; and the need for reliable measurements with the handling of large volumes of data (Benos et al. 2021; Bal and Kayaalp 2021).

ML is used in conjunction with Big Data, as it allows analyzing a volume of data that is generated after processing and filtering data coming from different heterogeneous sources. Agricultural Big Data has technologies that allow ML algorithms to perform their work. According to Cravero et al. (2022b), the most commonly used ML techniques in Agricultural Big Data are Neural Networks (NNs), Random Forest (RF), Support Vector Machine (SVM), and Decision Tree (DT). Figure 8 shows a list of ML techniques and the number of times they have been used in Big Data in the last 5 years.

Fig. 8
A horizontal bar chart plots the M L techniques in Big data for agriculture. The values for the neural network are the maximum, followed by the values for the random forest and support vector machine.

ML techniques in Agricultural Big Data

Some examples of their use are listed below.

3.3.1 Neural Networks

NNs are an excellent choice for working with large datasets because they have great flexibility to adapt to these, reducing the error produced by adjusting the weights and biases of each neuron based on the data it is trained with (Priya and Ramesh 2020). Saggi and Jain (2022) implemented the NN and compared its performance alongside other ML techniques. The NN was the best-performing technique, avoiding model overfitting and demonstrating excellent capabilities for estimating daily crop evapotranspiration.

Doshi et al. (2018) used NNs for automatic crop recommendation due to its built-in support for multilabel classification. In this task, the technique performed well with 91% classification accuracy. Shelestov et al. (2020) found that the most sensitive parameters for the classification accuracy of an NN are the number of hidden neurons and the regularization of alpha coefficients.

3.3.2 Random Forest

Some of the applications of RF are crop prediction, crop yield under adverse conditions, identification of climatic variables, and analysis of agriculture-related problems such as nitrogen emissions or drought prediction (Priya and Ramesh 2020). Furthermore, RF is ideal for working with massive datasets, as it needs less time for data preprocessing, is proficient in global time complexity, and works well with sparse datasets (Priya and Ramesh 2020).

Doshi et al. (2018) implement RF for crop recommendation due to its built-in support for multiple-label classification (MLC), highlighting that this technique is effective for handling missing values and is resistant to model overfitting. The latter feature is one of the reasons for its implementation in the classification of South Asian croplands (Gumma et al. 2020).

3.3.3 Support Vector Machine

SVM is suitable for handling small datasets that do not contain too many outliers, and its performance is increased when the dimensional space of the data is ample. However, the attributes are lower (Priya and Ramesh 2020).

In Nóbrega et al. (2018), different ML algorithms, including SVM, are compared to detect the conditions of an animal concerning posture data. Of the analyzed algorithms, SVM was the one that presented the worst performance; however, its results do not differ noticeably from the rest of the algorithms, and all of them had over 95% accuracy. A similar case was observed in Yang et al. (2018), where after comparing different ML techniques for predicting the growth state of a plant, it was observed that SVM had the lowest accuracy, although this was above 90%. In both cases, SVM was not the most suitable technique for the tasks performed, but it demonstrated a good level of accuracy.

Shelestov et al. (2020) found that the most sensitive parameters of SVM are gamma, C, and the type of kernel used. They performed measurements on the latter using Radial Basis Function (RBF) and sigmoid kernels and found RBF to be the most appropriate for crop classification tasks.

3.3.4 Decision Tree

DT is efficient in terms of computation and scalability. Moreover, its performance increases when the data are uncorrelated (Priya and Ramesh 2020). The efficiency of this technique is proven in Nóbrega et al. (2018), where they compared different ML techniques for the classification of an animal’s posture using data collected by an IoT collar. Of the compared techniques, the authors highlighted DT due to the low computational time required for model training and the easy subsequent interpretation of the model; they also presented one of the best values of accuracy and area under the curve (AUC) of the compared techniques.

On the other hand, Yang et al. (2018) investigated the prediction of the growth status of a plant using different ML techniques. The authors concluded that DT was the best algorithm compared due to the low time consumed and the high level of accuracy presented.

4 Challenges and Future Work

In general, data analysis for agriculture brings many benefits (Chergui et al. 2020) such as:

  • Providing the farmer with helpful information and helping him to make decisions on how much, when, and where to apply nutrients, water, seeds, fertilizers, and other agricultural chemicals and inputs.

  • Protecting the environment and helping to obtain healthy products, as it allows varying the number of inputs (irrigation, fertilizers, and pesticides) and even seeds used for crop production, and applying those inputs in exact amounts in each field.

  • Data-driven management gives farmers access to sophisticated management solutions against climate change and other environmental challenges and natural phenomena. Thanks to these solutions, farmers can continuously monitor crop health and receive timely alerts about potential pests, disease problems, or climate change.

  • From a marketing point of view, farmers can also benefit from advanced models that provide information about the market and which products could bring them the most profit.

According to Basnet and Bang (2018), there are still many challenges for data analysis in agriculture. The authors explain that improved technology will add more precision, accuracy, speed, and reliability to data analysis and reduce costs. On the other hand, it is crucial to achieve technology standardization to improve communication between agricultural equipment and research and open-source projects to improve the quality of technological solutions. The authors also explain that user-friendly technical solutions are required, as they must be adapted to local contexts and needs. It is also essential to improve the understanding of the use of Big Data by systematically promoting the concept, its practical use, the need for multidisciplinary work, and the value of its use, expanding education and awareness of the use in data analysis with Big Data.

Lassoued et al. (2021) analyzed the impact and potential of Big Data in agriculture. They identified several challenges related to data sources because not all value chain segments capture data in the same way. For example, they noted that there is no standard for data capture, which makes it difficult to harmonize and compile data from various sources. Another major obstacle identified is data governance. While most experts surveyed were willing to share their data under certain conditions, many expressed concerns about data privacy, security, cybercrime, and intellectual property protection.

On the other hand, Bhat and Huang (2021) examined the challenges of data collection and analysis. The combination of data from various sources raises concerns about the quality of information and its fusion. In addition, the volume of information collected causes safety and security concerns. The datasets collected are vast and complex, making it challenging to handle standard intelligent analysis procedures. These methods often do not work well when applied to agricultural data. The authors expect scalable and versatile methods to adapt to large amounts of information. Weersink et al. (2018) explained that data must be collected consistently and comply with protocols that allow them to be pooled on centralized servers. These servers must be protected from cyberattacks while masking the identity of the operations managers.

Regarding the use of ML in Big Data, a major challenge is to cope with a large volume of data. For example, the SVM algorithm has a training time complexity of O(n3) and a space complexity of O(n2), where n is the number of training samples. An increase in the value of n will drastically affect the time and memory required to train this algorithm and may even become computationally infeasible for large datasets.

A common assumption of ML is that algorithms can learn better with more data and provide more accurate results. However, massive datasets impose several challenges because traditional algorithms were not designed to meet such requirements (Cravero et al. 2022b).

In Agricultural Big Data, a combination of technologies is required, as data from experts, videos, and satellite images will be processed in batches. On the other hand, data from social networks and sensors will be processed by streaming. In the case of cloud-based technologies, there are several tools for ML use: Microsoft Azure Machine Learning, now part of Cortana Intelligence Suite; Google Cloud Machine Learning Platform; Amazon Machine Learning; and IBM Watson Analytics (Yang et al. 2017). Established vendors offer these services, which provide scalability and integration with other services and platforms.

Other challenges include understanding the statistical characteristics of the data before applying algorithms and the ability to work with larger datasets (Sukumar 2014). In addition, specific knowledge is required for certain problems in agriculture, such as increased production, quality improvement, and climate change, among others.

The future of Agricultural Big Data development and ML use is promising. This future will increase the effect of flexible Big Data architectures that consider various alternative ML techniques depending on the conditions of the data generated. This increase is possible thanks to developed and constantly evolving technologies. On the other hand, cloud computing will increase due to new professionals’ training and network speed improvement. Cloud computing and other tools will include more alternative ML techniques, which will facilitate flexibility.

As for ML techniques, the use of DL and other techniques mentioned in this chapter, which were adapted to specific contexts due to problems with data volume, processing speed, variability, and veracity, will increase. However, these problems can be solved by classifying data storage through the Data Lake.

Future research should focus on implementing appropriate decision support systems for accurate crop decisions, natural resource management, and climate change mitigation.