Keywords

1 Introduction

Pesticides are extensively used in agriculture aiming at the control of weeds or plant diseases, which may remain in residual amounts in fruits, vegetables, grains, and water, just to name a few. They stand for xenobiotic compounds for living organisms, and their toxicity is not due to a single molecular event or interaction, but rather to a set of occurrences, starting with pesticide exposure and reaching a point of highest development with the expression of one or more toxic endpoints. These happenings include adsorption, distribution, biotransformation, distribution of metabolites, interaction with cellular macromolecules and excretion [1]. Biotransformation may result in the formation of less toxic and/or more toxic metabolites, while the various other processes determine the balance between toxic and a nontoxic upcoming [2]. Pesticides can be absorbed by oral, dermal, nasal and/or ocular exposure. Human exposure can be dietary recreational and/or occupational, and toxicity could be acute or chronic. Aggregate exposure and risk assessment involve multiple pathways and routes, including the potential for pesticide residues in food and drinking water, as well as in residues from pesticide use in residential and non-occupational environments [3]. To ensure the safety of the food supply for human consumption, Maximum Contaminant Levels (MCLs) sets the legal limits for the amount of pesticides allowed in food and drinking water. This is related with Acceptable Daily Intake (ADI), defined as the amount of a chemical that can be consumed safely every day [4].

Pesticide environmental fate and toxicity depends on the physical and chemical characteristics of pesticide, the soil composition, soil adsorption, and pesticide residues found in different soil compartments. The human hazard is determined by the pesticide properties, exposure time and the individual’s susceptibility, affecting the magnitude of these processes and the final fate and toxicity of pesticide [5]. Indeed, agricultural pesticides are incorporated into the organism by different routes which can be stored and distributed in different tissues, leading to an internal concentration that can induce alterations, adverse effects and/or diseases. Often, the human exposure to pesticides was evaluated only by human biomonitoring, i.e., measuring levels in matrixes such as blood and urine. However, the concentration measured might not relate to toxic effect [5, 6].

Recent works established pesticide impact and toxicity based on chemical properties, environmental fate and exposure considerations [79]. However, those methodologies for problem solving are not able to deal with incomplete data, information or knowledge. Indeed, for the development of intelligent decision support systems aimed at integrated pesticide toxicological risk assessment, it is necessary to consider different conditions with intricate relations among them. Thus, the present work reports the founding of a computational framework that uses knowledge representation and reasoning techniques to set the structure of the information and the associate inference mechanisms, i.e., it will be centered on a Proof Theoretical approach to Logic Programming (LP) [10], complemented with a computational framework based on Artificial Neural Networks (ANNs) [11].

2 Knowledge Representation and Reasoning

Many approaches to knowledge representation and reasoning have been proposed using the Logic Programming (LP) epitome, namely in the area of Model Theory [12, 13], and Proof Theory [10, 14]. In the present work the Proof Theoretical approach in terms of an extension to the LP language is followed. An Extended Logic Program is a finite set of clauses, given in the form:

where “?” is a domain atom denoting falsity, the p i , q j , and p are classical ground literals, i.e., either positive atoms or atoms preceded by the classical negation sign \( \neg \) [10], that stands for a strong declaration that speaks for itself, and not denotes negation-by- failure, or in other words, a flop in proving a given statement, once it was not declared explicitly. Under this formalism, every program is associated with a set of abducibles [12, 13], given here in the form of exceptions to the extensions of the predicates that make the program, i.e., clauses of the form:

$$ exception_{{p_{1} }} \cdots exception_{{p_{j} }} \left( {0 \le j \le k} \right), \quad \quad \left( {being\,k\,an\,integer\,number} \right) $$

that stand for information or knowledge that cannot be ruled out. On the other hand, clauses of the type:

$$ ?\left( {p_{1} , \cdots ,p_{n} ,not\,q_{1} , \cdots ,not\,q_{m} } \right)\left( {n,m \ge 0} \right) $$

also named invariants or restrictions to complain with the universe of discourse, set the context under which it may be understood. The term scoring value stands for the relative weight of the extension of a specific predicate with respect to the extensions of peers ones that make the inclusive or global program.

In order to evaluate the knowledge that may be associated to a logic program, an assessment of the Quality-of-Information (QoI), given by a truth-value in the interval \( 0, \ldots ,1 \), that branches from the extensions of the predicates that make a program, inclusive in dynamic environments, is set [15, 16]. On the other hand, a measure of one’s confidence that the argument values or attributes of the terms that make the extension of a given predicate, with relation to their domains, fit into a given interval, is also considered, and labeled as Degree of Confidence (DoC) [17]. The DoC is evaluated as described in [17] and computed using \( DoC = \sqrt {1 - \Delta l^{2} } \), where \( \Delta l \) stands for the argument interval length, which was set to the interval 0, …, 1. Thus, the universe of discourse is engendered according to the information presented in the extensions of such predicates, according to productions of the type:

$$ predicate_{i} - \bigcup\limits_{1 \le i \le m} {clause_{j} \left( {\left( {QoI_{{x_{1} }} ,DoC_{{x_{1} }} } \right), \cdots ,\left( {QoI_{{x_{m} }} ,DoC_{{x_{m} }} } \right)} \right)::QoI_{i} ::DoC_{i} } $$
(1)

where and m stand, respectively, for set union and the cardinality of the extension of predicate i . QoI i and DoC i stand for themselves.

3 Case Study

In order to develop a predictive model to assess the pesticides toxicological risk a knowledge database was set, and built around the pesticides records of the National Pesticide Information Center [18]. For each pesticide it was considered information regarding environmental fate, human exposure and toxicity (i.e., acute and chronic) both in qualitative and quantitative terms. This section demonstrates how the information comes together and how it is processed.

3.1 Qualitative Data Pre-processing

Aiming at the quantification of the qualitative information and in order to make easy the understanding of the process, it was decided to put it in a graphical form. Taking as an example a set of 3 (three) issues regarding a particular subject (where the possible alternatives are none, low, moderate, high and very high), a unitary radius circle split into 3 (three) slices is itemized (Fig. 1). The marks in the axis correspond to each of the possible choices. If the answer to issue 1 is high the area correspondent is \( \pi \times 0.75^{2} /3 \), i.e., \( 0.19 \pi \) (Fig. 1(a)). Assuming that in the issue 2 are chosen the alternatives high and very high, the correspondent area ranges in the interval \( \pi \times 0.75^{2} /3 \cdots \pi \times 1^{2} /3 \), i.e., \( 0.19 \pi \cdots 0.33\pi \) (Fig. 1(b)). Finally, in issue 3 if no alternative is ticked, all the hypotheses should be considered and the area varies in the interval \( 0 \cdots \pi \times 1^{2} /3 \), i.e., \( 0 \cdots 0.33\pi \) (Fig. 1(c)). The total area is the sum of the partial ones and is set in the interval \( 0.38 \pi \cdots 0.85\pi \) (Fig. 1(d)). The normalized area is the ratio between the area of the figure and the area of the unitary radius circle. Thus, the quantitative value regarding the subject in analysis is set to the interval \( 0.38 \cdots 0.85 \).

Fig. 1.
figure 1

A view of the qualitative evaluation process.

3.2 A Logic Programming Approach to Data Processing

It is now possible to build up a knowledge database given in terms of the extensions of the relations (or tables) depicted in Fig. 2, which denote a situation where one has to manage information in order to evaluate the Pesticide Toxicological Risk. Under this scenario some incomplete and/or default data is present. For instance, in the former case the ADI is unknown (depicted by the symbol ), while the Acute Toxicity for Mice/Rats is not conclusive (Slightly/Moderate).

Fig. 2.
figure 2

A fragment of the knowledge base for Toxicological Risk Assessment.

The Human Exposure table is populated with 0 (zero) that stands for absence, 1 (one) that denotes food or drinking water only (in dietary column), and dermal or inhalation exposure only (in occupational column), and 2 (two) stand for simultaneous exposition. The issues presented in Environmental Fate table are populated with absence, low, medium, high and very high, while the columns of Acute and Chronic Toxicity tables are filled with absence, slightly, medium, high and very high. In order to quantify the information present in these tables the procedures already described above were followed.

Applying the algorithm presented in [17] to the table or relation’s fields that make the knowledge base for Pesticide Toxicological Risk Assessment (Fig. 2), and looking to the DoCs values obtained as described in [17], it is possible to set the arguments of the predicate t oxicological r isk a ssessment (tra) referred to below, whose extensions denote the objective function with respect to the problem under analyze:

$$ \begin{aligned} & tra:A_{cceptable} D_{aily} I_{ntake} , M_{aximum} C_{ontamination} L_{evel} , E_{nvironmental} F_{ate} , A_{cute} \\ & \quad \quad \quad T_{oxicity} , C_{hronic} T_{oxicity} , D_{ietary} H_{uman} E_{xposure} , O_{ccupational} E_{xposure} \to \left\{ {0,1} \right\} \\ \end{aligned} $$

where 0 (zero) and 1 (one) denote, respectively, the truth values false and true.

The algorithm presented in [17] encompasses different phases. In the former one the clauses or terms that make extension of the predicate under study are established. In the subsequent stage the arguments of each clause are set as continuous intervals. In a third step the boundaries of the attributes intervals are set in the interval [0, 1] according to a normalization process given by the expression \( (Y - Y_{min} )/(Y_{max} - Y_{min} ) \), where the Y s stand for themselves. Finally, the DoC is evaluated as described in Sect. 2.

Exemplifying the application of the algorithm presented in [17], in relation to the term (clause) that presents the feature vector ADI = 0.01, MCL = , EF = 0.28, AT = [0.64, 0.81], CT = [0.04, 0.06], DHE = 1, OE = 2, one may have:

4 Artificial Neural Networks

On the one hand, ANNs denote a set of connectionist models inspired in the behaviour of the human brain. In particular, the MultiLayer Perceptron (MLP) model stands for the most popular ANN architecture, where neurons are grouped in layers and only forward connections are set [19]. This provides a powerful base-learner with some advantages with respect to other approaches (e.g., adaptability, robustness, flexibility, nonlinear mapping and noise tolerance), a reason why they are increasingly used in data mining, namely due to its good behaviour in terms of predictive knowledge [20]. The interest in MLPs was stimulated by the advent of the Backpropagation algorithm in 1986, and since then several fast gradient based variants have been proposed (e.g., RPROP) [21]. Yet, these training algorithms minimize an error function by tuning the modifiable parameters of a fixed architecture, which needs to be set a priori. The MLP performance will be sensitive to this choice, i.e., a small network will provide limited learning capabilities, while a large one will induce generalization loss (i.e., over fitting). MLP is molded on three or more layers of artificial neurons, including an input layer, an output layer and a number of hidden layers with a certain number of active neurons. In addition, there is also a bias, which is only connected to neurons in the hidden and output layers [19]. The correct design of the MLP topology is a complex and crucial task, commonly addressed by trial-and-error procedures (e.g., exploring different number of hidden nodes), in a blind search strategy, which only goes through a small set of possible configurations. More elaborated methods have also been proposed, such as pruning [22] and constructive [23] algorithms, although these perform hill-climbing and are thus prone to local minima [11]. The number of nodes in the input layer sets the number of independent variables, and the number of nodes in the output layer denotes the number of dependent ones [19].

On the other hand, the framework presented previously shows how the information comes together and how it is processed. In this section, a data mining approach to deal with the processed information is considered. A hybrid computing approach was set to model the universe of discourse, where the computational part is based on ANNs, whose behavior was referred to above, and used not only to structure data but also to capture the problem(s) objective function’s nature (i.e., the relationships between inputs and outputs) [24, 25].

Figure 3 shows a case being submitted to the Pesticide Toxicological Risk Assessment model. The normalized values of the interval boundaries and its QoI’s and DoC’s stand for the inputs to the ANN. The output is given in terms of Pesticide Toxicological Risk evaluation and the degree of confidence that one has on such a happening. In this study 142 pesticides were considered (i.e., one hundred and forty two terms or clauses of the extension of predicate tra). To implement the evaluation mechanisms and to test the model, ten folds cross validation were applied [19]. The back propagation algorithm was used in the learning process of the MLP. As the output function in the pre-processing layer it was used the identity one, while in the other layers we considered the sigmoid.

Fig. 3.
figure 3

The ANN topology

A common tool to evaluate the results presented by the classification models is the coincidence matrix, a matrix of size L × L, where L denotes the number of possible classes. This matrix is created by matching the predicted and target values. L was set to 2 (two) in the present case. Table 1 presents the coincidence matrix of the ANN model, where the values presented denote the average of 25 (twenty five) experiments. A glance at Table 1 shows that the model accuracy was 93.7 % (133 instances correctly classified in 142). Therefore, the predictions made by the ANN model are satisfactory, attaining accuracies higher than 90 %.

Table 1. The coincidence matrix for ANN model.

Based on coincidence matrix it is possible to compute sensitivity, specificity, Positive Predictive Value (PPV) and Negative Predictive Value (NPV) of the classifier. Briefly, sensitivity evaluates the proportion of true positives that are correctly identified as such, while specificity translates the proportion of true negatives that are correctly identified. PPV stands for the proportion of cases with positive results which are correctly classified while NPV is the proportion of cases with negative results which are successfully labeled. The values obtained for sensitivity, specificity, PPV and NPV were 94.7 %, 91.5 %, 95.7 % and 89.6 %, respectively. On the one hand, the proposed model correctly identified 94.7 % of the positive cases, i.e., pesticides with potential toxicological risk. On the other hand, it also classified appropriately 91.5 % of the negative cases, i.e., pesticides with low toxicological risk.

The present model, beyond to consider the pesticide chemical properties, enables the integration of Acute and Chronic Toxicity data with other factors such as Environmental Fate and Human Exposure, being therefore assertive in the prediction of Pesticide Toxicological Risk. Thus, it is our claim that the proposed model is able to evaluate the Toxicological Risk properly and can be a major contribution to achieve high levels regarding public health protection and environmental sustainability.

The LP approach to data processing presented in this work is a generic one, and therefore may be applied in different grounds. Indeed, some interesting results have been obtained, namely in the fields of Education [26, 27], pharmacological properties of Essential Oils evaluation [28, 29], and Health [30, 31].

5 Conclusions

The proposed approach is able to give an adequate response to the need for a good response to predict the toxicological risk of pesticide exposure. Nevertheless, it can be considered a hard task since it is necessary to consider different variables and/or conditions with complex relations entwined among them, where the data may be incomplete, self-contradictory, and even unknown. In order to overcome these difficulties this work presents the founding of a hybrid computing approach that uses a powerful knowledge representation and reasoning mechanism to set the structure of the information, complemented with a computational framework based on ANNs, which have been selected due to their proper dynamics, like adaptability, robustness, and flexibility. This approach not only allows evaluating the pesticide toxicological risk, but it also permits the estimation of the degree-of-confidence that one has on such a happening. In fact, this is one of the added values of this approach that arises from the complementarily between Logic Programming (for knowledge representation and reasoning) and the computing process based on ANNs. The present model is a generic one, susceptive of application in different arenas. A possible limitation on its use is not on the model in itself, but on the unavailability of data, information or knowledge; but, even in these situations, once it has the capacity to handle incomplete data, information or knowledge, either in its qualitative or quantitative form, its usefulness is assured. Future developments of the model should include the biotransformation pathways and routes of exposure, and consider the contact time and the individual’s susceptibility. Furthermore, this problem might be approached using others computational frameworks like Case Based Reasoning [27], Genetic Programming [14], or Particle Swarm [32], just to name a few.