Abstract
An immense amount of data relevant to agriculture is generated from the vast scholarly literature. To get as much relevant information as possible from the data, we need to extract the context and meaning from them. Semantic web technology can provide context and meaning to the data. Named entity recognition (NER) systems can help to extract the named entities and the relations between the entities. In addition to that, these entities and relations can be used to build a knowledge graph (KG) which can be stored using a resource description framework (RDF) and queried with SPARQL. In this paper, we propose an NER dataset that contains a total of thirty-six types of entities and nine types of relations, which can be used to build a KG.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Research papers on agriculture contain information about the latest advances in the field, yet it is not always easily accessible to practitioners including scientists and farmers, for a variety of reasons including that the number of papers is huge and that the information is not available in a structured format. Agricultural industries, researchers, food processing companies, and many organizations need to extract entities such as crop names, pesticides, factors that affect plant growth, etc., and their relationships to make useful and strategic decisions. A named entity recognition (NER) system helps to extract knowledge entities from unstructured sources [2]. There are a few works on NER in agriculture, some of which like [1, 4] apply deep learning. A few datasets are available to train NER systems. Malarkodi et al. [7] have proposed nineteen entity types in the agriculture domain, but it does not cover many important aspects of agriculture. In addition to that, their corpus is not publicly available. Lun et al. [6] focus on four entity types, namely, Crop, Disease, Pest, and Drug, however, limiting to only Chinese agricultural websites. Gangadharan et al. [3] have worked with only three types of entities, namely, Disease, Soil, and Fertiliser, using only Indian agricultural websites. Liu et al. [5] have worked with six types of entities, namely Organism, Trait, Method/Equipment, Chemical, Gene, Environment, and Miscellaneous using article abstracts of ten typical horticultural journals. In contrast to the above works, our corpus is an annotated collection of abstracts from agriculture research papers, and our set of entity types and relations is significantly larger. In this paper, we propose thirty six entity types and nine relations between the entities. Our contributions to this paper are as follows:
-
1.
We introduce a fine-grained tag set comprising 36 useful entities in the agricultural domain.
-
2.
We introduce 9 relations between the entities, including symmetric and asymmetric relations.
-
3.
We introduce a publicly available fully annotated corpus with the above tags.
The corpus is publicly available on GitHubFootnote 1. The rest of our paper is organized as follows. In Sect. 2 we propose a taxonomy for the entities and relations. We provide dataset statistics in Sect. 3. In Sect. 4, we apply a machine learning model for NER on this dataset. We conclude in Sect. 5.
2 Proposed Taxonomy
Our dataset is built from abstracts of research papers in agriculture. After analyzing the abstracts, we have developed a list of entity types and relations to cover most of the important knowledge aspects of the papers. The proposed tag set contains thirty-six named entities that we believe can help in research in the agriculture domain. The named entity types are Agri_Pollution, Agri_Process, Agri_Waste, Agri_Method, Chemical, Citation, Crop, Date_and_Time, Disease, Duration, Event, Field_Area, Food_Item, Fruit, Humidity, Location, ML_Model, Money, Natural_Disaster, Natural_Resource, Nutrient, Organism, Organization, Other, Other_Quantity, Person, Policy, Quantity, Rainfall, Season, Soil, Technology, Temp, Treatment, Vegetable, Weather. The terms are self-explanatory.
We have extracted nine relations to form meaningful connections between the entities. We define three symmetric relation types Coreference_Of, Conjunction, Synonym_Of, and six asymmetric relation types Caused_By, Helps_In, Includes, Originated_From, Used_For, Seasonal. A detailed description of the entity types and relations is available in our GitHub repository.
3 Dataset Statistics
The quality of the dataset influences the knowledge graph constructed and the machine learning models trained on it. We have hand-picked the abstracts of 180 papers from several reputed agricultural journals, such as Asian Journal of Agricultural and Food Sciences (AJAFS)Footnote 2, The Indian Journal of Agricultural SciencesFootnote 3, and a few journals from IEEE and Springer Nature. We have analyzed the abstracts of these papers and recent trends in agriculture like [8, 9], and then we have decided on thirty six entities and nine types of relationships among the entities. Table 1 displays a summary of the number of occurrences of each annotated entity in the proposed dataset in percentage.
We have used the freely available brat toolFootnote 4 for annotation. One of the challenges was the entity class imbalance. To solve this problem, we have first counted the occurrences of the mentions of each entity type. Then, we added more data to the corpus to increase the count of the least frequent entity type. In total, we have 14,307 word-tokens and 1348 entity mentions. We have partitioned the dataset in a 70:30 ratio, with 70% data for training, and 30% data for testing.
4 Machine Learning-Based Extraction of Named Entities
To provide a baseline for an automatic NER system for the dataset, we have trained spaCyFootnote 5 with the entities we have labeled. spaCy is a free open-source library for natural language processing in Python. spaCy v3.0 provides a transformer-based pipeline, where we can train the model with our custom data. We first initialize the spaCy pipeline with tok2vec and ner models and then trained the model for several epochs with our custom entities. This model can recognize entities in unstructured data from the agricultural domain.
Table 2 displays the classification metrics and the results. For simplicity, we have restricted to two digits after the decimal point. We have excluded the results for some of the entity types due to their very low occurrence in the test data. In Table 2, the support for some of the entities is low because of their low occurrence in the test dataset. Due to the size of the entity class, it is conceivable that not all of the entities were observed while predicting on the test dataset. Figure 1 displays some parts of the knowledge graph built using the proposed dataset.
5 Conclusion
In this paper, we have introduced a total of thirty six entities, and three symmetric and six asymmetric relations extracted from several agricultural research papers. The NER dataset is organized into a knowledge graph. In the future, we intend to use semantic web technologies to make the graph semantically richer by linking it to other relevant knowledge graphs. We hope better ML models will be built to improve the classification performance, and that our dataset will inform and motivate further research on the construction and application of agricultural knowledge graphs.
References
Devi, M., Dua, M.: ADANS: an agriculture domain question answering system using ontologies. In: Proceedings of the 2017 International Conference on Computing, Communication and Automation (ICCCA), pp. 122–127. IEEE (2017)
Drury, B., Fernandes, R., Moura, M.F., de Andrade Lopes, A.: A survey of semantic web technology for agriculture. Inf. Process. Agric. 6(4), 487–501 (2019)
Gangadharan, V., Gupta, D.: Recognizing named entities in agriculture documents using LDA based topic modelling techniques. Procedia Comput. Sci. 171, 1337–1345 (2020)
Li, W., Chen, P., Wang, B., Xie, C.: Automatic localization and count of agricultural crop pests based on an improved deep learning pipeline. Sci. Rep. 9(1), 7024 (2019)
Liu, Z., Luo, M., Yang, H., Liu, X.: Named entity recognition for the horticultural domain. J. Phys. Conf. Ser. 1631, 012016 (2020). IOP Publishing
Lun, Z., Hui, Z., et al.: Research on agricultural named entity recognition based on pre train BERT. Acad. J. Eng. Technol. Sci. 5(4), 34–42 (2022)
Malarkodi, C., Lex, E., Devi, S.L.: Named entity recognition for the agricultural domain. Res. Comput. Sci. 117(1), 121–132 (2016)
Sinha, B.B., Dhanalakshmi, R.: Recent advancements and challenges of internet of things in smart agriculture: a survey. Futur. Gener. Comput. Syst. 126, 169–184 (2022)
Verma, K.K., et al.: Recent trends in nano-fertilizers for sustainable agriculture under climate change for global food security. Nanomaterials 12(1), 173 (2022)
Acknowledgement
This work is implemented as part of the “Extraction, Organization and Query of Scholarly Information”, sponsored by the Science & Engineering Research Board, Govt. of India.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
De, S., Sanyal, D.K., Mukherjee, I. (2023). AgriNER: An NER Dataset of Agricultural Entities for the Semantic Web. In: Pesquita, C., et al. The Semantic Web: ESWC 2023 Satellite Events. ESWC 2023. Lecture Notes in Computer Science, vol 13998. Springer, Cham. https://doi.org/10.1007/978-3-031-43458-7_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-43458-7_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43457-0
Online ISBN: 978-3-031-43458-7
eBook Packages: Computer ScienceComputer Science (R0)