1 Introduction

There are many factors contributing to an increasing bottleneck. One is the very composition of traffic (Wei Qu et al. 2012). A lane along with four trucks in a row does not have the same capacity as one that is only occupied by cars. In general, heavy vehicles travel slower and generate longer lines of vehicles. Another factor that negatively affects traffic is the interaction between cars. When traffic density is above 1700 vehicles per hour any overtaking or maneuvering causes a reduction in the speed of the road (Hernández et al. 2002). Branch road or slip road may also complicate matters in jam, as they bring an extra supply of vehicles to a road which, in itself, is already saturated.

Decreasing speed helps to prevent traffic jams. It allows the flow to move more evenly. If all vehicles driving down a road are at a low speed, the capacity of track circulation would not saturate so quickly. The mass of cars would move slowly but evenly. It is a very complex problem, and possibly, the most feasible solution would be to create more and better infrastructures. The problem is that as more roads and highways are built, the number of cars also increases, with eventually the same traffic issues arising. Different solutions have been attempted in cities around the world, with varying degrees of success (Carlson et al. 2010).

Some highways have added new lanes to extend them in a general way, using the hard shoulder or decreasing the size of the existing lanes if feasible. These adjustments are both expensive and time consuming. As it depends on the population growth and on the sales of vehicles, this is a medium-term solution (Ahmane et al. 2013).

A less costly solution would be to study how traffic behaves using information obtained from roadside sensors (Dunkel et al. 2011) based at particular segments of the road network, referencing these variables: (Lee et al. 2010) congestion, intensity, occupancy, average speed of vehicles, etc. Obtaining a model for traffic control varies depending on the area of study.

Traffic congestion in the Spanish city of Seville, in particular, on the bridge of the Fifth Centenary, is a latent problem. The huge volume of vehicles transiting in either direction may vary for different slots of time. Northbound traffic and southbound traffic are separated by a moveable barrier, which permits the number of lanes allocated to northbound traffic or southbound traffic to be changed to accommodate heavy demand in either direction, to respond to unusual traffic conditions, and to balance lane availability with real-time traffic demand. It is impossible to leave the reversible lane from its beginning to its end by any alternative path. Therefore, the proper management of this lane and even the possibility of providing alternatives to traffic at peak times become an essential task.

This paper presents an approach of general-purpose methodology devoted to predict congestion of vehicles on roads, with application to the spot before mentioned. In particular, ensembles of well-known machine learning algorithms have been proposed. Four different scenarios of possible retention have been analyzed, those in which the anticipation of traffic retention was of 15, 30, 45, and 60 min. In these scenarios, on average, the results achieved an accuracy greater than 85 %. It is note worthy that this paper is an extension of that previously published in Florido et al. (2015), in which single soft computing methods were applied.

The rest of the paper is divided as follows. Section 2 reviews the most recent and relevant works related to traffic congestion prediction. Section 3 describes the proposed methodology to predict traffic congestion. The achieved results are presented and discussed in Sect. 4. Finally, the conclusions of this paper are drawn in Sect. 5.

2 Related works

Urban traffic-flow prediction is a task of utmost relevance to decrease traffic congestion in large-scale urban areas. For this reason, many prediction methods have been proposed and measured according to different metrics. In this sense, an interesting review that explores different methods and different metrics, with a short critique of measurement criteria, can be found in Rao and Rao (2012).

The prediction of traffic flow on the highway in St. Louis metropolitan area was addressed in Amin et al. (1998). The authors developed models based on radial basis function neural networks based on the historical data retrieved from nine different sensors. The results achieved outperformed those of ARIMA.

The work in Pescaru (2013) proposed a technique for congestion prediction in urban traffic. In particular, it used event-based routes selection and relied on information collected by a sensor network. Simulation experiments with more than 50 traffic patterns over eight crowded intersections demonstrated promising results.

Two urban traffic prediction models using different modeling approaches were introduced in Liang and Wakahara (2014). The first model was based on the traffic-flow propagation in the network, while the second one on the time-varied spare flow capacity on the concerned road links. Predictive errors were reduced up to 22 % when compared with the existing shift model.

The application of neuro-fuzzy techniques to predict traffic volume of vehicles on a busy traffic corridor was established in Ogunwolu et al. (2014). Using as case of study the city of Lagos, a traffic prediction system was designed based on acquired real-time traffic data. The network was trained and the fuzzifier module categorized the numerical output of the model. Information on estimated time and fuel consumption cost are also provided.

A deep architecture that consists of two parts, i.e., a deep belief network (DBN) at the bottom and a multitask regression layer at the top were presented in Huang et al. (2014). A DBN was employed for unsupervised feature learning, showing that it can learn effective features for traffic-flow prediction in an unsupervised fashion. Abundant experiments showed that the proposed model achieved close to 5 % improvements over other approaches found in the literature.

An study to attempt to extend deep learning theory into large-scale transportation network analysis can be found in Ma et al. (2015). A deep restricted Boltzmann machine and recurrent neural network architecture are utilized to model and predict traffic congestion evolution based on global positioning system (GPS) data from taxi. Results from a Chinese city showed that the prediction accuracy can reach 88 %.

An approach to predict the urban traffic congestion using floating car trajectory data efficiently was proposed in Kong et al. (2016). The authors used a new fuzzy comprehensive evaluation method, in which the weights of multi-indexes are assigned according to the traffic flows. The prediction was made using a particle swarm optimization algorithm, which calculates some traffic-flow parameters. The results achieved outperformed other methods in terms of accuracy, instantaneity, and stability.

An approach based on chaos theory for predicting motorized traffic flow on urban road networks was introduced in Adewumi et al. (2016). Nonlinear time series modeling techniques were used with emphasis on the largest Lyapunov exponent to aid in the prediction of traffic flow. Despite the results are promising, the authors did not perform any comparative study to evaluate its performance.

The paper in Li et al. (2016) proposed an adaptive data-driven real-time congestion prediction method. This framework included a traffic pattern recognition algorithm based on the adaptive K-means clustering, a two-dimensional speed prediction model, and an adaptive threshold calibration method. After the principal component analysis, the adaptive K-means cluster algorithm is conducted to obtain different traffic patterns. Congestion recognition is realized with an adaptive threshold calibration method and congestion prediction is then raised according to different traffic patterns. Parameter calibration and model evaluation were carried out on the proposed method using floating car travel speed data. The results showed better real-time performance, accuracy, and robustness than ARIMA model and Kalman filtering method.

Finally, a very recent survey of current urban traffic management schemes for priority-based signalling, and reducing congestion and the average waiting time of vehicles is accessible in Nellore and Hancke (2016). The main objective of this survey is to provide a taxonomy of different traffic management schemes used for avoiding congestion. Existing urban traffic management schemes for the avoidance of congestion and providing priority to emergency vehicles are considered and set the foundation for further research.

3 Methodology

This section describes the proposed methodology. First, the information provided by the stations of data collection is described in Sect. 3.1. Later, in Sect. 3.2, the base machine learning classifiers are listed along with a brief description of them. The procedure to create ensembles is detailed in Sect. 3.3. Finally, the validation process followed to assess the methodology performance is summarized in Sect. 3.4.

3.1 Information provided by the stations of data collection

The stations of data collection use a set of sensors installed in the roadway to detect the passage of vehicles. These data refer to the following traffic variables obtained from all vehicles registered during the integration period (IP) which, in this case, is set to 1 min, are summarized in Table 1.

Table 1 Description of the variables retrieved from the stations of data collection

Finally, information about future occurrence of traffic congestion is included in the class. This class, which is binary and only accepts 0 and 1 as possible values, is generated in a similar way as in Asencio-Cortés et al. (2015, (2016). That is, if congestion is to happen within the following preset minutes (in this study, those minutes are 15, 30, 45, or 60, as described in Sect. 4.3), then an 1 is assigned to the class. If not, 0 is assigned. The assignment of such values is determined by the HIOCC algorithm, developed by the Transport and Road Research Lab (TRRL) (Collins 1993) and which comes implemented in the station of data collection itself in the period of integration. It is expressed as YES \(=\) 1 or NO \(=\) 0. This variable will be the class label and, therefore, the one wanted to be correctly predicted.

Feature selection in the field of urban traffic congestion prediction is a relevant issue (Yang 2013). For this reason, several variables described in the previous section might not be considered. In particular, the variables Kamikaze alarm, reverse direction, average length, composition length, composition speed, and error have been discarded due of its null relevance for this analysis. Therefore, four input variables were used to train models: intensity, occupation, average speed, and average distance.

3.2 Base machine learning algorithms

A set of seven algorithms of machine learning have been selected to evaluate their effectiveness in the traffic congestion prediction: nearest neighbors (K-NN), C4.5 decision trees (C4.5), artificial neural networks (ANN), stochastic gradient descent optimization (SGD), a fuzzy unordered rule induction algorithm (FURIA), Bayesian networks (BN), and support vector machines (SVM). All the seven algorithms are able to address a classification problem; therefore, they are called classifiers in this work. Specifically, the data sets used to validate the performance of traffic congestion prediction have two classes (binary classification).

Since results easily interpretable are desired, the capability of a machine learning algorithm of producing a viewable knowledge model is also a relevant property to select the most suitable for this problem. Three out of the seven classifiers generate an interpretable model: C4.5 (trees), FURIA (rules), and BN (graphs).

Below, there is a short description of each of the seven algorithms studied as base classifiers. In first place, K-NN algorithm (Aha et al. 1991) belongs to the case-base reasoning group of machine learning and it is based on the proximity between the sample to be classified and the samples that form the training set. Therefore, it is a deterministic algorithm whose critical point is to select the number of neighbors to use. The most common strategies simply choose the closest instance (the sample belonging to the training set that gives the shortest distance from the sample that you want to be labeled) or take an odd number of neighbors to ensure no tie between their classes, whether it is a binary classification.

The C4.5 algorithm (Quinlan 1993; Salzberg 1994) is a method for generating a decision tree, which consists in selecting an attribute as root of the tree and creates a branch for each of the possible values for that attribute. With each resulting branch (new node of the tree), the same process is done, another attribute is selected and a new branch for each possible attribute value is generated. This procedure continues until the samples are classified through one of the paths of the tree. The end node of each path is a leaf node, to which the corresponding class is assigned.

Therefore, the objective of the decision trees is obtaining rules or relationships that allow classification based on the attributes. This algorithm allows the use of the concept of information gain, build decision trees when some of the samples show unknown values for some of the attributes, working with attributes that have continuous values, decision tree pruning, and obtaining classification rules.

ANNs (McCulloch and Pitts 1943) are a family of statistical learning algorithms inspired by biological neural networks that are used to estimate or approximate functions that may depend on a large number of inputs and are generally unknown. ANNs are generally presented as systems of interconnected neurons that can calculate values from inputs, and are able of machine learning and pattern recognition, thanks to its adaptive nature.

SGD (Duchi et al. 2011) is an optimization technique that can be used for classification purposes by learning various linear models (SVM with linear kernel was used according to the binary class of the data sets). For SVM, the error function (to be minimized) is continuous and it is called the hinge loss. The SGD procedure is computationally simple and converges fast, allowing models, such as linear SVM or logistic regression, to be learned from high-dimensional data sets.

FURIA (Hühn and Hüllermeier 2009) is based on the rule learner RIPPER algorithm (Cohen 1995), but it obtains fuzzy rules. The rules are put into unordered sets and it makes use of an efficient rule stretching to deal with uncovered instances. To obtain fuzzy rules, FURIA generates the conventional rules from a modified RIPPER, and then, it searches for the best fuzzy extension of each rule, where a fuzzy extension is a rule of the same structure, but with intervals replaced by fuzzy intervals. The fuzzification is then realized for the antecedent with the largest purity. This process is repeated until all antecedents have been fuzzified.

BN (Friedman et al. 1997) are probabilistic machine learning algorithms that construct networks based on the Bayes’ theorem. The algorithm is based on two components: a function for evaluating a given network based on the data and a method for searching through the space of possible networks. The quality of a network is measured by the probability of the data given such network. The probability that the network accords to each instance is calculated and these probabilities are multiplied together over all the instances. The BN algorithm is based on the independency of the attributes and it was designed for classification tasks.

SVM (Schölkopf and Smola 2002) is a binary classifier for numeric data that use separating hyperplanes as the decision boundary between the two classes. The optimization problem of determining these hyperplanes is set up with the notion of margin. A maximum margin hyperplane is one that cleanly separates the two classes, and for which a large region exists on each side of the boundary with no training data points in it. In linearly separable data, it is possible to construct a linear hyperplane which cleanly separates data points belonging to the two classes.

The implementation of the algorithms of this study, as well as their parameters setup, are those included into the free data mining tool Weka (Hall et al. 2009).

Once the algorithms have been described, the particular configuration used in the experimentation is now described. For K-NN, the number of neighbors (K) was set to 1 and the Euclidean distance was used without any type of distance weighting. In the C4.5 algorithm, the confidence factor for prunning was set to 0.25 and the minimum number of instances covered by the leaf nodes was 2. A minimum description length correction was used when finding splits on the numeric attributes to avoid large and, therefore, less interpretable tree models.

The ANN algorithm was configured as follows. A multilayer perceptron using the backpropagation technique wa,s used. The network has three layers: an input layer, a hidden layer and an output layer. According to the number of input attributes, there were four neurons at the input layer. There was one hidden layer with three neurons. Finally, the output layer has two neurons, according to the two possible classes (traffic congestion or non-congestion). The learning rate, that determines the step size when searching for a solution and hence how quickly the search converges, was set to 0.3.

In the SGD algorithm, the loss function was set to the hinge loss (linear SVM), the number of iterations was set to 500, the regularization constant to 0.0001, and the learning rate was set to 0.01.

In the FURIA algorithm, the T-norm of the fuzzy part was established as the product T-norm, threefolds were used for pruning the rules, the minimum total weight of instances in rules was 2 and the number of optimization runs was set to 2.

For BN, a genetic search with tournament selection was used as search algorithm and a simple estimator is used for estimating the conditional probability tables of the Bayes’ network (with \(\alpha =0.5\)).

The SVM implementation used is the included in the LibSVM (Chang and Lin 2011) free package and it was configured as follows. The SVM type was the C-SVC for classification tasks, the kernel was the polynomial of degree 1, and the cost parameter C was set to 1.

3.3 Ensemble algorithms

The ensemble learning is a paradigm of machine learning motivated by the fact that different classifiers may make different predictions due to the specific characteristics of the classifier, or their sensitivity to the random issues in the training data. An ensemble algorithm is an approach to increase the prediction accuracy by combining the results from multiple classifiers. The basic approach of ensemble analysis is to apply the aggregated classifiers (internal classifiers) multiple times using either different models, or using the same model on different subsets of the training data. The results from different classifiers are then combined into a single-robust prediction.

Ensemble algorithms intend to improve the accuracy of a classifier by reducing their bias and variance. Some ensemble methods, such as bagging and Random Forests (RF), are designed only to reduce the variance, whereas other ensemble methods, such as boosting and stacking, can help reduce both the bias and variance. In some cases, such as boosting, it is possible for the ensemble to overfit the noise in the data and thereby lead to lower accuracy.

To select a set of representative algorithms of ensemble learning, three classic approaches were considered to perform a comparative study: bagging, boosting, and stacking. Moreover, a meta-learning-based algorithm that selects best probability thresholds to classify binary classes, named probability threshold selector (PTS), was also included in the ensemble algorithms group of study in this work. The objective is to check whether some base classifiers might be improved using ensemble techniques in the traffic congestion prediction problem.

The bagging technique (Breiman 1996; Kuncheva 2004) is based on a simple approach to vote the predictions of a set of classifiers. The procedure, in general terms, is the following. The training set is divided in several training subsets using a bootstrap re-sampling, each one with the same number of examples. Next, from each subset, a classifier is trained, and given a test set, a prediction for each classifier is requested. The final class is computed by the aggregation of the predictions of all classifiers. Such aggregation generally consists of a simple voting assembly (e.g., the majority predicted class).

Although the bagging technique achieves generally best results than the base classifiers, those have been aggregated independently (each one is trained with random samples). For this reason, the performance of bagging strongly depends on the ability of the classifiers to complement among them; this means that the individual errors of a classifier can be corrected by the others. Therefore, combining multiple classifiers only could achieve an improvement in effectiveness when their models are significantly different from one another and each one classifies correctly a reasonable percentage of the data set.

The baseline configuration for the bagging algorithm assumes the same internal classifiers to perform the voting assembly. In this work, the bagging technique was applied using both the same internal classifier and different ones. Each of the seven base classifiers described in Sect. 3.2 was the internal classifier for the bagging algorithm. In addition, the Random Forests (Breiman 2001) algorithm, which is a bagging of decision trees, was also been included in the comparison study.

Fig. 1
figure 1

Methodology applied to validate the seven base classifiers

The boosting approach iteratively seeks for models that complement well among them covering with correct predictions the maximum amount of data. Unlike the bagging technique, in the boosting procedure, each new model depends on the performance of those built previously. The instances incorrectly predicted in each iteration are assigned with a greater weight and importance to be predicted correctly in the further iterations.

The selected boosting algorithm was the AdaBoost M1 (Freund and Schapire 1996). This algorithm begins by assigning equal weight to all instances in the training data. Next, the internal classifier is trained with these instances, and those are reweighted according to the classifier output. The weight of correctly classified instances is decreased, and the weight of those that are misclassified is increased. In the further iterations, classifiers are trained with the reweighted data, which consequently focuses on classifying the hard instances correctly.

In addition to the algorithm AdaBoost M1, a Logistic Model Trees (LMT) learning algorithm (Landwehr et al. 2005) was analyzed and included in the comparison study, providing a method that produces an interpretable knowledge model based on trees. LMT is based on the LogitBoost algorithm (Friedman et al. 2000), and it induces trees with linear-logistic regression models at the leaves.

The stacking class of ensembles arises from the idea of non-homogeneous boosting that takes the outputs of a group of classifiers and produces the final classification by the weighted majority voting. In this procedure, two layers are involved: at the lower level are the base classifiers, and at the upper level is a master classifier combining their outputs. In non-homogeneous boosting, a linear classifier is used at the upper level. This architecture is the base of the stacking approach.

The stacking technique takes a set of diverse classifiers used at the lower level. At the upper level, instead of using only a linear classifier, any general classifier can be used. In fact, both the base classifiers and the master may come from the most diverse paradigms.

The stacking algorithm used in this study was the published in Wolpert (1992). Each of the seven base classifiers described in Sect. 3.2 was probed to be the master classifier for the stacking algorithm. For each master classifier, the six remaining classifiers act as lower level classifiers in the stacking.

Fig. 2
figure 2

Methodology applied to validate the bagging and boosting algorithms

Fig. 3
figure 3

Methodology applied to validate the stacking algorithm

Finally, a meta-learning based algorithm named PTS was also included in the study. PTS selects a mid-point threshold on the probability output by a base classifier. The mid-point threshold is set, so that a given performance measure is optimized. The F-measure was used to provide such a measure. Performance is measured using a tenfold cross-validation over the training data. In addition, the probabilities returned by the base learner can have their range expanded, so that the output probabilities will reside between 0 and 1. Each of the seven base classifiers described in Sect. 3.2 was probed as the base classifier for the PTS algorithm. The PTS implementation used was the included in the Weka external package named ThresholdSelector.

3.4 Validation process

To validate the effectiveness of the classifiers, a tenfold cross-validation was performed three times for each of the four data sets of this study (60, 45, 30, and 15 min). This is a classic procedure to validate classifiers in machine learning, and it consists of the following steps. First, each data set is divided in both training and test subsets. Second, the training set was used to train the classifier and to produce its knowledge model. Then, such model was used to predict the instances of the test set, and finally, the resulting predictions were compared to the actual test classes producing the performance metrics.

The process of training the classifier may imply different steps depending on the type of the classifier. Thus, the base classifiers studied in this work follow the simple training process described before, as it can be seen in Fig. 1. However, ensemble methods generally imply an iterative subprocess to train the internal classifiers, in which their predictions are based on. Specifically, the algorithms that belong to the bagging and boosting ensemble approaches may need to perform a determined number of iterations over their internal classifiers, as shown in Fig. 2. In particular, the bagging and boosting algorithms used were trained with all the seven base classifiers described in Sect. 3.2, that is, one experiment for each one of them. During the ensemble training process, ten iterations were performed.

The stacking and PTS algorithms perform an internal cross-validation (using tenfold) to conduct the training of the base classifiers and to tune the consensus of their predictions. This process is shown in the Figs. 3 and 4.

Fig. 4
figure 4

Methodology applied to validate the PTS algorithm

4 Experiments

In this section, the data sets used to train and test the classifiers are explained. Next, the parameters used to evaluate the quality of the results are presented. Finally, the results produced in the experimentation are shown and discussed.

4.1 Data sets description

For this study, data were obtained from the detector during the last quarter of 2011 using a total of 8926 records for testing, and 27,321 for training. In addition, the Congestion variable has been taken as class label attribute, since this attribute takes values YES \(=\) 1 or NO \(=\) 0. As data mining techniques have been applied to various intervals of time, the class has been moved to analyze the behavior at 15, 30, 45, and 60 min. That is, for each of the four scenarios evaluated, the class label corresponds to whether there will be or not congestion in the next 15, 30, 45, or 60 min.

4.2 Quality parameters

The quality parameters used to evaluate the different methods are now presented. It was decided to use the most common ones in the literature. On the one hand, a set of indicators that quantify the successes and mistakes of classifiers is calculated:

  1. 1.

    True positive, TP. It is defined as the number of times the classifier assigns a 1 to the instance that is being classified (predicts the occurrence of a retention), and this, indeed, happens during the next few minutes.

  2. 2.

    True negative, TN. It is the number of times which has been predicted that retention does not occur during the next minutes, and, in fact, it does not.

  3. 3.

    False positive, FP. The number of times has been erroneously detected a retention in the next minutes, that is, the number of times the classifier assigns a label with value 1 when really ought to assign a 0.

  4. 4.

    False negative, FN. The number of times a retention has not been detected, but, indeed, there was a retention within the next several minutes.

On the other hand, from these four indicators, the quality parameters are properly calculated. In particular:

  1. 1.

    Recall (also known as sensitivity). It is the proportion of correctly identified retentions, regardless the value of the FP. Mathematically, it is expressed as:

    $$\begin{aligned} S=\frac{\text {TP}}{\text {TP}+\text {FN}}. \end{aligned}$$
    (1)
  2. 2.

    Area under the curve, AUC. It is defined as the definite integral of the receiver-operating characteristic (ROC), on the interval [0, 1], where ROC identifies the graphic presentation of the relationship between both sensitivity and specificity. Formally:

    $$\begin{aligned} \mathrm{AUC}=\int _0^1 \mathrm{ROC}(t) \mathrm{d}t . \end{aligned}$$
    (2)
  3. 3.

    Precision (also known as positive predictive value, PPV). This value measures the reliability of the TP, that is, the certainty associated with each TP. In other words, the relationship between TP and FP is, mathematically, formulated as:

    $$\begin{aligned} \text {PPV}=\frac{\text {TP}}{\text {TP}+\text {FP}} . \end{aligned}$$
    (3)
  4. 4.

    Matthew’s correlation coefficient, MCC. It was proposed by Matthews in 1975 (Matthews 1975). It gives the balanced measure of TP, FP, TN, and FN. Mathematically,

    $$\begin{aligned} \text {MCC}= \frac{\text {TP}\times \text {TN} - \text {FP}\times \text {FN}}{\sqrt{(\text {TN}+\text {FN})(\text {TP}+\text {FP})(\text {TP}+\text {FN})(\text {TN}+\text {FP})}} . \end{aligned}$$
    (4)
Table 2 Validation results for the base classifiers

4.3 Results

After the validation process explained in the methodology section, the results for each type of experimentation are discussed in the following subsections. For each experimentation, a non-parametric statistical test was performed to determine whether the differences in effectiveness among the classifiers are statistically significant. The MCC was selected as the metric to compare the effectiveness in the statistical tests.

Specifically, the Kruskal–Wallis and Friedman tests have been performed, as well as a post-hoc test using Mann–Whitney tests with Bonferroni correction. Such analyses have been carried out using the free on-line tool STATService (Parejo Maestre et al. 2012). To have enough data to carry out the statistical tests, 30 values of the MCC were taken for each classifier and data set (one value per fold and execution; tenfolds \(\times \) three executions).

4.3.1 Validation results for base classifiers

All the results for the base classifiers and different horizon of predictions are shown in Table 2. ANN achieves effectiveness significantly better than the other base classifiers in terms of the AUC measure for all data sets. With respect to the hardest data sets (45, 30 and 15 min), the FURIA algorithm performed significantly better in terms of MCC values. Moreover, FURIA is able to produce an interpretable model based on fuzzy rules.

Therefore, regarding the results achieved by the base classifiers, the best election, in terms of effectiveness and interpretability, has been found to be the FURIA algorithm for all the data sets. It is also important to highlight that the choice of this algorithm, since an easy interpretation of the results is highly desired. In this sense, the structure shown by the rules is of immediate interpretation for any professional who is not related to data mining.

Table 3 Validation results for the bagging-based ensembles
Table 4 Validation results for the boosting-based ensembles

4.3.2 Validation results for ensemble classifiers

The bagging, boosting, stacking, and PTS approaches were applied to the different data sets of study, producing significant improvements with respect to the performance achieved by the base classifiers. However, these improvements are only in the hardest data sets (45, 30, and 15 min). For the 60 min data set, no significant differences with the base classifiers were found.

The results for the bagging ensembles are shown in Table 3. Note that the bagging technique was applied to the seven base classifiers. In addition, the random forests (RF) algorithm is placed in the rightmost column of the table. The results for the boosting ensembles are shown in Table 4. The boosting technique was also applied to the seven base classifiers. Moreover, the logistic model trees (LMT) algorithm is placed at the end of the table. Finally, the results for the stacking and PTS ensembles are shown in Tables 5 and 6.

Specifically, for the 45 min data set, the MCC was significantly improved from 0.32 (achieved by FURIA) to 0.46 (achieved by a bagging over BN). Regarding the 30 min data set, the MCC was significantly improved from 0.48 (achieved by FURIA) to 0.52 (achieved by the PTS technique over BN). For the 15 min data set, the MCC was improved, with statistically significance, from 0.55 (achieved by FURIA) to 0.58 (achieved by the PTS technique over ANN).

It can be also noticed that the classifier BN is the algorithm that improves the more its accuracy when it was wrapped by an ensemble technique, specially when automatically selecting a threshold for its classification probability.

Moreover, regarding the precision and recall values using the ensemble techniques, although recall values were generally lower, the precision was increased. Such result is relevant for traffic congestion prediction, where it is desirable to achieve more precise forecasts even though the coverage was lower, due to the high cost derived from false alarms.

Table 5 Validation results for the stacking-based ensembles
Table 6 Validation results for the PTS ensembles

5 Conclusions

The issue of predicting urban traffic congestion has been addressed in this paper. A novel methodology has been proposed, which turns the prediction problem into a binary classification one. Thanks to this transformation, seven well-known base classifiers have been used. In addition, ensembles have been proposed, not only by combining the seven aforementioned methods, but also including methods that are ensembles themselves. To assess the performance of the approach, four different scenarios have been created: prediction of traffic congestion 15, 30, 45, and 60 min ahead. In particular, data from Seville (Spain) have been analyzed and used for predictive purposes. Several ensembles have achieved successful results, with accuracy (MCC) up to 83 % for the 60 min horizon of predictions considered. Nevertheless, all ensembles reached satisfactory results, leading to conclude that the methodology designed in, in general, robust for this problem. Future work is directed towards the inclusion of new attributes as well as for discovering how urban traffic congestion may affect to different spots, that is, how congestion is propagated through different routes.