Keywords

1 Introduction

Freshwater is considered one of the most important of planet’s renewable natural resources. In this sense, it is vital to study and evaluate the water quality in lotic ecosystems, which represent water ecosystems in constant motion and in the same direction, such as rivers and basins [1]. A problem occurs in Rio Piedras Basin, which is located on the western slope of the Central Cordillera, west of Popayán (Colombia) and it is the main water supplier source of 9 rural communities: Huacas, Laureles, Canelo, Quintana, San Juan, Santa Teresa, Laguna, San Ignacio, and San Isidrio [2, 3]. Nevertheless, these communities do not make a control water quality.

A significant amount of research has been conducted to develop water quality detection systems, which allow monitoring activities of water quality on different basins [47] and lotic ecosystems [810] around the world [5, 1117], through supervised learning (SL) algorithms. SL tasks predict or classify a new input data from examples (instances), commonly called training data (composed of attributes and a target variable), through algorithms such as decision trees (DT), Bayesian networks (BN), Artificial Neural Networks (ANN), K-Nearest Neighbor (K-NN) and Support Vector Machines (SVM) [18]. However, these research approaches set aside the data quality verification (i.e. redundant attributes, duplicate instances, imbalanced dataset, etc.).

Therefore, an improvement of data processing techniques for water quality detection system for Rio Piedras Basin, based on supervised learning techniques, which considered the issues founded on a data quality verification phase. The remainder of this paper is organized as follows: Sect. 2 describes the study area description, the data quality issues addressed and supervised learning algorithms used; Sect. 3 refers the data processing for water quality detection system proposed; Sect. 4 presents results and discussion and Sect. 5 relates conclusions and future work.

2 Background

2.1 Study Area Description

The data used in this study were collected quarterly in the Rio Piedras Basin, located in Cauca department, Colombia (source: 76° 31′ 10′′ west of Greenwich and 2° 21′ 45′′ north latitude, mouth of river: 76° 23′ 45′′ west longitude and 2° 25′ 40′′ north latitude), by the Environmental Studies Group (ESG) from the University of Cauca, between 2011 and 2013, taking into account the methodology followed in [19]. Captured samples contain biological (macroinvertebrates) and physicochemical variables, at three points of the basin: Puente Alto, Puente Carretera and Bocatoma Diviso, in different precipitation periods: high (October-November), average (June-July) and low (August September).

Thus, there were captured 10 physicochemical indicators, 5 biological indicators and 3 precipitation periods. In total, the built dataset consists of 645 records, and 3 values to classify (classes) [17], as set forth in Table 1.

Table 1. Dataset Rio Las Piedras attributes

According to the latest work, the three (3) values to classify are denoted by the numbers 1, 2 and 3, which represent a high water quality (very clean water), good (slightly polluted water) and Regular (moderately polluted), respectively.

2.2 Dimensionality Reduction

The dimensionality reduction is the transformation of high-dimensional data in a meaningful representation of smaller dimensions. This reduced representation must have the minimum number of parameters required for expressing the observed data properties [20, 21]. The mentioned task is primarily oriented towards two objectives: instances and attributes reduction techniques.

Attributes Reduction (AR).

The attribute reduction decreases the dimension of the attributes within a dataset [2124]. AR methods are grouped into two categories: attribute selection and extraction; the first looks the best subset of features according to certain criteria (choice of attributes and/or number of attributes to be selected), discarding redundant, inconsistent and irrelevant attributes, while the second transforms the high dimension attributes set in a space of smaller dimension [25, 26].

Noteworthy is the importance of AR mechanisms, in optimizing a dataset. However, when applying techniques for selecting attributes, information loss is generated [22], which is a problem in small datasets, such as the dataset Rio Las Piedras. Therefore, this paper chose to use attribute extraction tactics.

At the same time, it was conducted a systematic review of 44 published researches from 2004 to the present, based on the guidelines set forth in [27], taking as search sources: IEEE Xplore (35 items), ScienceDirect (9 items), focused in application domains such as: intruder detection, medicine, biometrics, facial recognition, satellite images classification, among others, obtaining that the algorithm of Principal Component Analysis (PCA) is the most used (33 items), and also overcomes the capabilities of information viewing and understanding than other extraction techniques. Therefore, this algorithm is taken as a starting point to reduce attributes that allow a water quality evaluation task. The explanation of the chosen techniques is presented in Sect. 3.

Instances Reduction.

Instances Reduction (IR) decreases the number of irrelevant instances within a dataset [2830]. Several authors propose classifications of IR methods, as is the case of [31], in which group IR techniques: Noise Filters, Condensation Algorithms and Prototype Algorithms, while in [32] are classified in: Wrapper and Filter. Furthermore, in [3335] ensemble methods as Cascading, stacking, Bagging, Boosting, Random Forest are used for IR tasks.

To select the appropriate algorithm for the instances extraction, 34 studies published from 2006 to the present were reviewed, using as search sources: IEEE Xplore (12 items), ScienceDirect (10 items), Springer Link (6 items) and Google scholar (6 items) and focusing the search in application areas such as: intruder detection, security, classifiers building, time series, text recognition, among others. In this review it was found that the most commonly used techniques are Ensemble and Wrapper with 16 and 12 papers respectively, while “Filter” methods are referenced only 6 times.

It is important to indicate that the Wrapper algorithms tend to be over-trained (overfitting) due to frequent use of cross-validation as evaluation technique on a single dataset, which tends to be adjusted to very specific training data features that do not have no causal relationship with the objective function [36]. Moreover, within the Ensemble methods, Boosting is the IR technique most often used [34] and it has a greater capacity of data generalization. Based on the above reasons, this research will take as its starting point the Boosting algorithm for IR tasks.

2.3 Imbalanced Classes

The classes imbalance problem occurs when the number of class label instances is greater (majority or negative class “C−”) on the number of instances that have other labels class (minority or positive class “C+”) [37, 38], and whose unbalance degree can be measured using IL (Imbalance level) reason [3941].

In this scenario, the classifiers have a tendency classification to the majority class, thereby minimizing the classification error and correctly classifying majority class instances detriment of minority class instances.

To solve the imbalance class problem, there have been two main approaches [38, 42, 43]: the external method (at data level) and the internal method (at classification algorithms level). The first consists of achieving a balance between classes by eliminating instances of the majority class (sub-sampling) or the inclusion of instances in the minority class (over-sampling); while the internal method adjusts the classifiers to favor the class minority. In this comparison the external method is the most versatile (because they do not require any change in the algorithm and also can be used in different application domains), the most widely used in the last decade and whose most representative algorithm is SMOTE (Synthetic Minority Over-Sampling Technique), which adds to the minority class examples by creating new instances (which called synthetic data) obtained from an interpolation process.

2.4 Classifiers for Water Quality Detection

For classifiers selection, there were taken 4 researches as a starting point [18, 4446], in which performing a literature review and theoretically evaluate supervised learning algorithms most commonly used as the case of Decision Trees (DT), Artificial Neural Networks (ANN), Bayesian Networks (BN), K-Nearest Neighbor (K-NN) and Support Vector Machines (SVM) considering metrics as: accuracy, noise tolerance, ability of explanation, learning speed and classification speed.

3 Mechanism for Detecting Water Quality in Rio Piedras Basin

The water quality detection system in Rio Piedras Basin, contains various components to perform a pre-processing data and supervised learning algorithms based on data mining tool: KNIME® Analytics Platform, all transparent to the end user. In Fig. 1 the architecture of the proposed mechanism is exposed, which is made up of data processing and classification modules. The first module consists of two components for dimensionality reduction (attributes and instances) to address the redundant values problem and a component that balances the classes. The second module consists of some supervised learning algorithms: DT, ANN, BN, K-NN and SVM.

Fig. 1.
figure 1

Mechanism architecture

Below, the proposed mechanism components are briefly described:

3.1 Automatic Selection of Principal Components (ASPC)

The Automatic Selection of Principal Component (ASPC) is based on the PCA algorithm definition, and proposes a new approach to Principal Components (PC) selection.

First, PCA is a multivariate technique from data exploratory analysis (statistical), which transforms a large number of attributes, correlated with each other, in a number of attributes smaller uncorrelated (linearly independent or orthogonal), called principal components (PC). The main components are a linear combination of the original attributes, which describe the greatest amount of dataset information and are sorted by the amount of information that they contain (high to low) [4749].

Currently, there is no definite rule on the exact number of PC to be used. However, to select the optimal PC subset, the authors in [50] exhibit various techniques such as: B1-Backward, B1-Forward, B2 and B4; obtaining better results with B4 method. B4 intends to use the PC whose accumulated explained variance exceeds 60 % of the information.

On the other hand, ASPC is the mechanism, which consists of building data subsets \( \varvec{X'}_{{\varvec{n*p}^{\varvec{'}} }} \), for \( p^{'} = 1,2, \ldots , p \) and \( p = EV \) (PCA determines eigenvalues (EV) and attributes p within the dataset) with \( p^{'} \) variables and n observations, from each of the eigenvalues and their eigenvectors.

From here, data subsets are constructed based on the accumulated \( \left\{ 1 \right\}, \left\{ {1 + 2} \right\}, \left\{ {1 + 2 + 3} \right\}, \ldots , \left\{ {1 + 2 + 3 + m } \right\} \) PC. Subsequently, each data subset is evaluated by means of classification accuracy obtained by some supervised learning techniques: SVM, BN, K-NN, DT and ANN. Finally, it is selected the data subset with better accuracy and thus the number of PC associated with this.

It is worth noting, for the main components analysis should be observed the relationship between PC and the initial attributes from the feature matrix taking into account the correlations sign and magnitude.

3.2 Instances Reduction (IR-E)

The mechanism defined for Instances Reduction (IR-E) takes as a starting point the model proposed in [34], which is based on the Boosting Classifier definition to create a model that selects redundant and irrelevant instances within a dataset and is called BIS (Boosting Instance Selection). This algorithm replaces the classification models for IR algorithms as: Decremental Reduction Optimization Procedure 3 (Drop3), IB3, Iterative Case Filtering (ICF), Modified Selective Subset (MSS), Reduced Nearest Neighbor (RNN), Condensed Nearest Neighbor Rule (CNN).

BIS performs a process M times (M is defined a priori) and, in each iteration, a voting process is executed, which consists of assigning one vote (\( v_{i} \)) to each selected instance (\( x_{i} \)) through a particular IR technique randomly chosen. After M iterations, it is obtained as a result a votes vector V, which records the obtained votes by each instance and, from the latter, a set of thresholds is built \( (\emptyset_{1} ,\emptyset_{2} , \ldots , \emptyset_{M} ) \), with the instances which obtained most votes. Now, to get the best threshold it is defined a criteria \( J(\emptyset ) \), which is represented by an instances subset \( S_{\emptyset } \) belonging to the training set T, so that meet the Eq. 1.

$$ \theta : S_{\theta } = x_{i} \in T: v_{i} > \emptyset $$
(1)

To asses each \( J(\emptyset_{i} ) \) criteria, the K-NN classifier is trained with the subset \( S_{{\emptyset_{i} }} \) where the quality of each instances subset is selected according to both, the classification performance and the number of removed instances, as seen in the Eq. 2. In fact, any classifier can be used (DT, SVM, BN o ANN); Nevertheless, for this research we will restrict ourselves to use K-NN due to its simplicity, ability to detect wrong results and high speed learning [18, 4446],

$$ J\left( \theta \right) = \alpha C + (1 - \alpha )r $$
(2)

Where, \( C \) symbolizes the classification performance (precision or AUC), r denotes the deleted instances percentage and \( \alpha \) is the parameter used to distinguish the importance of each factor.

On the other hand, each threshold assessment involves a K-NN classifier training with the instances subset that represents it, which implies a high degree of complexity of assessment process (\( 2M + 2 \)) when number of iterations M is high (the maximum number of possible thresholds is equal to M).

To address the problem, optimal threshold selection called \( \emptyset_{o} \), from an approach that works on two levels. The first level consists of constructing a subset of thresholds (\( \theta_{1} ,\theta_{2} , \ldots , \theta_{p} \)) from the set of possible thresholds \( (\emptyset_{1} ,\emptyset_{2} , \ldots , \emptyset_{M} ) \) for \( p \le M \), discarding the repeated or equivalent thresholds latter, because these get similar assessments in terms of performance and classification codes. Thus a degree of less than or equal complexity (\( 2{\text{p}} + 2 \)) is achieved.

At the second level it seeks to find a threshold value representing both high voting values as low values and thus counteract the drawbacks mentioned above. For this, the arithmetic mean of the thresholds subset resulting from the previous step is determined, which process is represented by the Eq. 3.

$$ \emptyset_{o} = \frac{1}{p}\mathop \sum \limits_{i = 1}^{p} \theta_{i} $$
(3)

3.3 Automatic Optimal Synthetic Data Selection (AOSDS)

AOSDS is based on the over-sampling algorithm SMOTE (Synthetic Minority Over-Sampling Technique) definition and proposes a strategy to generate the appropriate number of synthetic data in such a way that minimizes over-training the classifier.

SMOTE currently works with binary classes, a majority and a minority class, denoted as C− and C+ respectively. This algorithm creates synthetic instances or data for the minority class, by interpolating an instance and its closest K neighbors belonging to that class. First, K closest neighbors from the minority class are selected and, subsequently, the synthetic instantiated percentage (denoted by P) is chosen. Then, to generate a new synthetic data, interpolation between the line connecting each minority class instance with any (or all) of their closest K neighbors previously selected is performed. This calculation is made using the Euclidean distance definition and a replacement or overlay (called overlap) function that assigns a value 0 (if both values are equal) or a value 1 (in the case they are different). Since SMOTE only applied to binary classes, this work focused the problem of imbalance dataset Rio Las Piedras on two classes: Class 1 and 3 (C− and C+ respectively) and ignoring the remaining class (class 2). However, it is important to mention that class 2 is only ignored in oversampling process.

Although this technique generates new instances that allow balancing classes, so far not found a method to indicate the optimal number of instances to be created, since a large number of synthetic instances can over-train the classifier and generate inaccurate results [42, 51, 52].

The proposed strategy involves determining that the original dataset imbalance level (IL) meets a minimum classes threshold unbalance. If the dataset meets the threshold of imbalance, we proceed to apply to the minority class definition SMOTE. For this case, an imbalance level \( IL \ge 3.9 \) is defined and instances percentage to oversample (P value) equal to 50, 100, 150 and 200 % will be used, thereby preventing the ratio C+ instances exceed those of C−. Furthermore, [53] discloses that to achieve a 200 % of oversampling only required two nearest neighbors, enough for the study case. Therefore, K = 2 is taken as the number of nearest neighbors for SMOTE.

Once applied the over-sampling process for a given P, the behavior of each subset generated is evaluated and the data subset that gets the best performance is selected. At the same time, an optimal synthetic data percentage, associated with that dataset, is defined.

3.4 Classifiers to Water Quality Detection

This component consists of classifiers to perform a water quality detection in Rio Piedras Basin, which were selected based on the research presented in [18, 4446] where theoretically evaluate some classifiers: DT, ANN, BN, K-NN and SVM, considering metrics as: precision, noise tolerance, explanatory capacity, learning speed and classification speed. In these researches, is claimed that there is no algorithm that satisfies all evaluation metrics. Also, depending on the dataset used, the algorithm has a different behavior (in metrics of model evaluation such as precision). Therefore, it is necessary to evaluate each of these algorithms in order to obtain a high degree of precision in predictions and easy interpretations. Additionally, it is important to mention that to evaluate the proposed mechanism, cross-validation with k = 10 was used.

4 Experimental Results

ASPC, IR-E and AOSDS methods were evaluated individually and collectively, using classifiers as SVM, BN, K-NN, C.4.5 (DT) y ANN, applied to the original and processed datasets. The experimental process can be seen in more detail in the Fig. 2.

Fig. 2.
figure 2

Experimental process for Dimension Reduction

4.1 Attributes Reduction

The results of PCA process on the Rio Las Piedras dataset are shown in Fig. 3. As shown in Fig. 3(a), 18 principal components (PC) are generated, where first PC represents 21.2 % of the total information, the second PC explains 15.2 % of the original variance, the third explains 11.6 %, and so on until PC 18, which represents 0.1 % of the total information. As discussed above, explained variance data are important to know the number of PC that will be used in the analysis. If we take B4 methodology as a criterion for components selection, it is considered that the optimal number of PC (reducing the PC amount as much as possible) is 5 components, which variance is 65.2 %. This means that the five (5) first PC represent 65.2 % of the total dataset information. As shown in Fig. 3(b), the other components explain significantly lower percentages comparatively to the first five components.

Fig. 3.
figure 3

PCA results on Rio Las Piedras dataset

Once applied the PCA process on the dataset, the ASPC proposed mechanism is applied, which involves as a first step in building a dataset with \( \left\{ 1 \right\}, \left\{ {1 + 2} \right\}, \left\{ {1 + 2 + 3} \right\}, \ldots , \left\{ {1 + 2 + 3 + \cdots 18 } \right\} \) principal components. Subsequently, the classifiers mentioned above are training with each of these datasets, with a 10-fold cross-validation. Finally, we obtain as a result the data subset that best precision obtained and the number of PC associated with this dataset. These results are shown in Table 2.

Table 2. ASPC results

As seen, the ASPC mechanism selects the dataset consisting of the first 5 PC as the best data set, because the latter had the best average precision (89.9 %) among all dataset formed. The reduction of 13 components obtained by the ASPC method led to improved classification performance by approximately 6 %.

In practical terms, the problem initially represented in an 18-dimensional cyberspace has been reduced to a smaller hyperspace (5 dimensions), capturing 65.2 % of the original variance. This result implies greater and easy data interpretability, process that, as mentioned above, is performed by analyzing the relationship between PC and initial attributes.

From these results, it can be assumed that the ASPC method can perform a quite strong selection of training data without deteriorating the capacity of classifiers.

Now, to compare the ASPC method with the B4 selection method in the Table 3, the results obtained are reflected.

Table 3. Comparison between ASPC and B4 methods

The ASPC mechanism applied on the Rio Las Piedras dataset was able to reduce the space to five components. These results corresponded with those obtained with the B4 method, which suggests that the new dataset is constituted with five components and whose information amount is within the accumulated variance confidence interval 60 % − 95 %. Moreover, the mechanism increased both the classification process performance as the model interpretability (fewer attributes).

4.2 Instance Reduction

By applying the IR-E method proposed for instance reduction on a water quality dataset, the results presented in Table 4 were obtained.

Table 4. RI-E results

As noted, the IR-E technique achieves to reduce to 309 the instances within the Rio Piedras dataset. However, IR-E decreased the classifiers precision in 27.7 %, indicating that important dataset instances were removed, thus losing important information.

4.3 Attributes and Instances Reduction

Under the same context, the Table 5 shows the results product to train classifiers with the newly acquired dataset after applying jointly reduction techniques presented above. As noted, the sequential composition of attribute reduction and subsequent instances reduction is denoted as ASPC + IR-E, while the sequential combination of instances reduction and subsequent attributes reduction is represented as IR-E + ASPC.

Table 5. Attribute and instance reduction methods in conjunction results

It is interesting that the IR-E + ASPC methods sequence reduced in greater quantity both, attributes such as number of instances, in comparison with the technique ASPC + IR-E. The first technique reduced 15 components and 420 instances, while the second reduced 13 components and 152 instances. However, when a classifier is trained with the dataset obtained trough ASPC + IR-E a higher precision (73.9 %) is obtained in comparison when the training process is performed with the dataset obtained trough IR-E + ASPC method (54.3 %). In addition, it may be noted that the dataset reduced by IR-E + ASPC is represented by three characteristics that explain only 41.5 % of the total variance, which is outside the confidence range suggested by the B4 criterion. This means that this method is not appropriate to reduce the Rio Las Piedras dataset size.

As a result, we can see that the ASPC + IR-E mechanism is a suitable solution for reducing the water quality dataset size, which allows classifiers provide a similar precision to that obtained with the original dataset (no pre-processing). Accordingly, the IR-E + ASPC method is not appropriate to reduce the dataset size.

4.4 Average Classifiers Training Time

It is noteworthy that another way to evaluate the algorithms performance (efficiency) is by reducing classifiers runtime in dataset processing task. This measure is important when the task of training a classifier for detecting water quality is related to a highly complex scenario (large amounts of data) or a decision support system that requires to define a contingency action in relatively short times.

For this analysis, the arithmetic average time that each classifiers takes to process a dataset is calculated. For this case study, the datasets would be the no pre-processing or original Rio Las Piedras dataset and processed or reduced Rio Las Piedras dataset. In Fig. 4 these results are summarized.

Fig. 4.
figure 4

Average classifiers training

In a first step, it is observed that the ASPC technique, in addition to reduce the number of dataset attributes (13), reduces the classifiers training time in 228 ms in comparison with the training time for the original dataset. This indicates that the existence of redundant features in the dataset influences the performance of classifiers.

Following the same behavior, it is observed that the IR-E technique reduce the dataset in 309 instance and the classifiers training time in 226 ms. The result suggests that by minimizing the amount of redundant instances and noise, it is possible to reduce the time necessary to training the classifiers (computational cost).

Now, in the case of reduction methods combined ASPC + IR-E and IR-E + ASPC applied on the Rio Las Piedras dataset, we have that in both cases is greatly reduced the time for training the classifiers (244 ms and 257 ms respectively). Additionally, it is noted that the approach ASPC + IR-E allows a classification process significantly faster than individual methods and even the combined method IR-E + ASPC.

These results are to be expected, because such mechanisms remove both datasets characteristics (instances and attributes), which represent a computational cost reduction in data mining tasks (classification and/or grouping).

As a result, the proposed strategy gets similar precision values to the original dataset and greatly reduces classifiers training time, making this proposal the most appropriate to reduce the dataset size for water quality detection.

4.5 Class Balance

To evaluate the proposed mechanism, a classifiers cross-validation was used with k = 10: SVM, BN, K-NN, C.4.5(DT) and ANN. On this approach, an imbalanced dataset Rio Las Piedras version (Table 6) and over-sampled datasets with 50 %, 100 %, 150 % and 200 % of synthetic instances (Table 7), was evaluated. As indicated above, the majority and minority class are denoted as C− and C+ respectively.

Table 6. Imbalanced dataset description
Table 7. Classifiers performance behavior over C− and C+ classes

The performance of each of these classifiers is obtained in terms of ROC metrics (Receiver Operating Characteristic) and F-Measure (measures in %), considered to evaluate both the original and the over-sampled datasets with different amounts of synthetic instances (50 %, 100 %, 150 % and 200 %). The results are shown in Table 7.

As indicated in the above table, to train the C.4.5 classifier with the original Rio las Piedras dataset (\( IL = 5.6 \)), instances belonging to the negative class tend to be correctly classified (M-F = 88 % y ROC = 75.6 %), while those belonging to the positive class tend to be classified incorrectly (M-F = 0 y ROC = 60.8 %). This is because the classification methods tend to favor the negative class.

Similarly, increasing the number of positive class instances with synthetic data, the classification performance of the latter improves, obtaining the best results when synthetic data at 150 % and 200 % are generated, with F-M = 59.5 %, and ROC = 79.8 % and F-M = 60 %, ROC = 81 % respectively. Although there is no significant difference in these results, and considering that what is sought in class balance process is that the classification process can correctly predict C+ instances without affecting significantly the C− instances detections, the best classification results are obtained when C+ is balanced with 150 % of synthetic instances. This oversampling level improvement performance metrics F-Measure and ROC in C+ class at 59.5 % and 19 % respectively, without affecting the C− class detection, moreover, affects only in 3.5 % the F-Measure and increasing the ROC value by 6.4 %.

Now, the K-NN measures shown in Table 7 exposes clearly the same behavior of the previous classifier (C.4.5), where the C+ class is oversampling at 150 % of synthetic instances and this allows the classifier to discriminate it better and detect it correctly (F-Measure and ROC metrics increase their performance to 38.7 % and 15 respectively), at the expense of 6 % of precision and 3.3 % of ROC area reduction for C− class. Meanwhile, when C+ class is oversampled with a sampling level of 200 %, F-Measure and ROC metrics improves at 5.3 % and 1.2 % with respect to the previous process, further reducing the performance of C− classification.

On the other hand, the SVM classifier got for each C+ class a value of F-Measure = 0. Otherwise it occurs in the C− class, where good results are obtained. From here, we can say that SVM classifier is very sensitive to the class imbalance problem, because this algorithm has its separation hyper-plane very close to the minority class, resulting in a low or null performance for examples of this class, compared to those of the majority class [54].

Like the BN, the ANN are more difficult to profit in the metrics when the dataset is oversampled with a percentage of less than 150 % synthetic instances. In fact, like the previous cases, the best results for all cases are obtained when the dataset is oversampled with this value (150 %), and can be seen as the best balance between the number of true and false positives.

From these results, we can see how in spite of balance the dataset trough SMOTE technique, the percentage of correctly classified instances by SVM, BN and ANN algorithms does not significantly increase, and is even lower than the results obtained when classifiers are trained directly with imbalanced dataset (original). Therefore, the fact of using synthetic samples generated trough SMOTE represents a gain in the discriminant capacity of classes, but not equal in all classifiers.

In this vein, the experimental results showed that oversample the imbalanced dataset with levels of 150 % and 200 % improves the detection of classifiers in C+ class without significantly impairing the detection of the C− class. In addition, it is observed that by using a level of imbalance or another, the same results are obtained approximately. However, to create too many synthetic instances produces overtraining, which affects the ability to detect the minority class.

From the foregoing, it is considered that the best level of synthetic instances oversampling is 150 %, and the SMOTE algorithm is considered as a practical method for generating synthetic instances. However, we must be especially careful in the number of generated instances, as it can make poor results in C− class prediction.

Based on the considerations submitted, in the Table 8 balanced dataset (with 150 % of synthetic instances) is described.

Table 8. Overview of processed dataset

As shown in the above table, to oversample the dataset with 150 % of synthetic instances increase in 54 the examples of Rio Las Piedras dataset.

4.6 Classification Module

This section presents the classifiers evaluation (DT, ANN, BN, K-NN, SVM) and results analysis, applied on the dataset described in the Table 1 and on the processed dataset after applying the dimensionality reduction and balancing classes approach (Table 8).

As mentioned above, the performance of each of these classifiers is obtained in terms of some metrics as: precision, recall, F-Measures and ROC, considered to evaluate both the original as the processed dataset (proposed mechanism). The results are presented in Fig. 5.

Fig. 5.
figure 5

Result of classifiers evaluation: SVM, ANN, K-NN, BN and C.4.5 applied to Rio Las Piedras original and processed datasets

In the case of the original dataset, the ANN and C.4.5 algorithms obtained the best results among all evaluated classifiers, incorrectly classifying a smaller instances proportion, compared to the other methods, which can be checked with the obtained precision (C.4.5 = 83 % and RNA = 76 %, compared to SVM = 62.6 %, K-NN = 70.1 % and BN = 65 %).

Additionally, it is important to mention that the C.4.5 classifier gets the highest proportion of true positives than other techniques because it has a higher recall (83 %), while other algorithms obtained a recall value not more than 70 % for the case of SVM, BN and K-NN. Regarding to F-Measure, the C.4.5 and ANN followed the same behavior, as they obtain the best values (83.3 % and 76.3 % respectively) among all the evaluated techniques. These results suggest that C.4.5 and ANN supervised algorithms have a best behavior for working with this dataset.

Similarly, the five supervised learning algorithms are trained with the processed dataset (Table 8), as shown in Fig. 5. This graph shows that for all classification models, the number of incorrectly classified instances was reduced, except for C.4.5, where remained approximately constant. This behavior is reflected in the increased precision of classifiers: BN (72.7 %), SVM (73.8 %) and K- NN (72.7 %) with the exception of ANN which failed to increase its accuracy (74.2 %).

However, the true positives proportion in relation to false positives is high, since the recall values exceed 72 % (SVM = 82.2 %, ANN = 81.7 %, BN = 81.3 %, K-NN = 72.4 % and C.4.5 = 83.6 %). The mentioned results let us to identify that the 5 evaluated models have a good confidence since the number of false positives is low and the number of relevant instances classified is high, as can be contrasted through F-Measure values (SVM = 77.3 %, ANN = 77.2 %, BN = 76.6 %, K-NN = 72.6 % and C.4.5 = 78.8 %).

5 Conclusions and Future Works

In the previous sections, mechanisms for datasets dimensionality reduction and water quality detection were proposed. From the results achieved, it can be inferred that the proposed mechanism (ASPC + IR-E) is an appropriate solution for pre-processing water quality datasets, in order to reduce their size, which allows classifiers to provide a similar precision to that obtained with the original dataset (no pre-processing). Similarly, the execution times to classifiers training tasks were reduced.

Additionally, the evaluated classification algorithms present good results. However, ANN and C.4.5 algorithms obtained the highest precision values, classifying fewer instances wrong and also kept the same behavior in all experiments. Therefore, these supervised learning models are chosen as the base classifiers to detect water quality in Rio Piedras Basin, taking into account the dataset characteristics associated with this domain.

As future work, it is intended to deploy the proposed mechanisms in a production environment; address other dataset problems, such as missing values and outliers; and use a similar methodology to generate a water quality prediction, in order to get the ability to construct an early warning system for lotic ecosystems.