Keywords

1 Introduction

Machine Learning (ML) and specially Deep Learning (DL) models have become one of the most effective and useful tool for prediction and inference in different environments such as biomedicine or smart cities. Although data availability is not a matter to be concerned, data imbalance could cause deterioration of performance of the models. No matter the size of the database if there are few instances of one of the possibles classes to be determined, the algorithm might generalise by classifying almost all instances as part of the majority class.

In many cases it is of special interest the correct classification of the minority class. For instance, in a cancer diagnosis problem the cost of predicting wrongly a patient with cancer as a cancer-free case is critical. Generally, in databases the minority cases become from patient that suffer cancer, and if this imbalance is extreme, models should generalize by classifying almost all instances as part of majority class, obtaining a high accuracy yet. Other examples of such imbalances could be found in fraud commitment detection where fraudulent cases are less frequent by far.

In traffic event prediction different factors are responsible for causing traffic delays or accidents, and identification of them in real time is crucial for avoiding uncomfortable situations. In this area the instances belonging to traffic misfortunes are a minority in comparison to usual traffic sensor readings too.

To avoid these situations in which minority class instances could not be detected, training the models with as much instances from minority class as majority class instances should be the solution. Over-sampling is a suitable methodology to modify the class variable distribution at a data-level stage (pre-processing), before the learning process. By this way, the model obtains enough information from the minority class to detect these exceptions while performing in real scenarios.

In this work, starting from two different alternatives for expanding original data we proposed three novel ways for generating new instances. Each of them are evaluated in a large, real-world dataset consisting of traffic sensor observations and from the different metropolitan areas from the state of California over a period of three months.

The rest of the paper is organised as follows. Section 2 reviews some of the most representatives works published in the literature. Section 3 specifies the new alternatives proposed by this work. In Sect. 4 the materials and methodology applied in this work are presented. In Sect. 5 we conduct different experiments of classification tasks using data generated by all the alternatives proposed and the results are presented. In Sect. 6 discussion of these results and conclusions are made.

2 State of the Art

Different methodology has been applied to expand original databases to obtain a more generalised source of knowledge resulting on an optimized inference model. In [7] they propose the primitive GAN algorithm in which a generator and discriminator play an adversarial process in which they simultaneously train two models: a generative model that captures the data distribution, and a discriminative model that estimates the probability that a sample came from the training data rather than from the generator. The training procedure for the generator is to maximize the probability of the discriminator making a mistake. By this way, new instances are created with similar characteristics of the original data.

Cibersecurity systems usually face the problem of data imbalance. In [9] they proposed a Multi-task learning model with hybrid deep features (MEMBER) to address different challenges like class imbalance or attack sophistication. Based on a Convolutional Neural Network (CNN) with embedded spatial and channel attention mechanisms, MEMBER introduces two auxiliary tasks (i.e., an auto-encoder (AE) enhanced with a memory module and a distance-based prototype network) to improve the model generalization capability and reduce the performance degradation suffered in imbalanced databases. Continuing with intrusion detection area, a tabular data sampling method to solve the imbalanced learning task of intrusion detection, which balances the normal samples and attack samples was proposed in [6]. In [14] TGAN was presented, as a method for creating tabular data creating discrete and continuous variables like medical or educational records. In [15] they developed CTAB-GAN, a novel conditional table GAN architecture with the ability to model diverse data types, including a mix of continuous and categorical variables, solving data imbalance and long tail issues, i.e., certain variables having drastic frequency differences across large values. In [4] they proposed a method to train generative adversarial networks on multivariate feature vectors representing multiple categorical values.

Bayesian network-based over-sampling method (BOSME) was introduced in [12], which is a new over-sampling methodology based on Bayesian networks. What makes BOSME different is that it relies on a new approach, generating artificial instances of the minority class following the probability distribution of a Bayesian network that is learned from the original minority classes by likelihood maximization.

Some other researchers opted for treating multi-modal data in order to optimize the trained network’s inference accuracy. In [13] they proposed an end-to-end framework named Event Adversarial Neural Network (EANN), which is able to obtain event- invariant features and thus benefit the detection of fake news on newly arrived events. In [8] they proposed an audio-visual Deep CNNs (AVDCNN) SE model, which incorporates audio and visual streams into an unified network model. For traffic event detection were also used other approaches that include data from multiple type and sources. In [2] they annotated social streams such as microblogs as a sequence of labelling problem. They presented a novel training data creation process for training sequence labelling models. This data creation process utilizes instance level domain knowledge. In [3] they proposed Restricted Switching Linear Dynamical System (RSLDS) to model normal speed and travel time dynamics and thereby characterize anomalous dynamics. They used the city traffic events extracted from text to explain those anomalous dynamics. In [10] they used human mobility and social media data. A detected anomaly was represented by a sub-graph of a road network where people’s routing behaviors significantly differ from their original patterns. They then try to describe a detected anomaly by mining representative terms from the social media that people posted when the anomaly happened. In [5] they used Twitter posts and sensor data observations to detect traffic events using semi-supervised deep learning models such as Generative Adversarial Networks. They extend the multi-modal Generative adversarial Network model to a semi-supervised architecture to characterise traffic events.

3 Proposed Approach

As we mentioned in the introductory part in classification environments in which data imbalances can cause the performance deterioration of the machine learning model, it is of special interest to have a balanced class distribution. For this purpose, BOSME was proposed tackling this issue by generating synthetic data following the probability distribution of a Bayesian network. Moreover, in the majority of the cases GANs become the first option at extending databases and address imbalance learning tasks. In this work, we assess both option and propose three new variants that raise from both methodologies.

3.1 Variant 1: Feeding the Discriminator of GAN with Data Proceeding from BOSME

The idea of GAN is to maximize the capability of the generator of creating instances as equal as possible as original ones by trying to confuse the discriminator and this last trying to distinguish real data from synthetic data. Originally, discriminator is fed by data generated by the generator raised from normal distribution. If we substitute these data by data generated by BOSME which expand databases including synthetic data from the minority class, the capability of distinguishing real data of the discriminator might be enhanced. Following this idea, we propose this variant, in which first BOSME is applied to the original database and next, a modified version of GAN is applied, where the discriminator is fed by the synthetic data proceeding from BOSME.

3.2 Variant 2: Feeding the Discriminator of GAN with Data Proceeding from BOSME+data Proceeding from the Generator

As the continuation of the variant proposed above, we expand the data with which the discriminator of GAN is fed. We mixed two types of data, the data proceeding from the generator, which is raised from noise, and the synthetic data proceeding from BOSME. By this way, a more general vision of the synthetic data could obtain the discriminator improving its ability to distinguish fake data from real data.

In the previous variant, the data proceeding from BOSME only feeds the discriminator with data from the minority class which could cause some problems in certain environments. In contrast, with this last variant this issue is tackled. A simplistic graphic description is given in Fig. 1

Fig. 1.
figure 1

Graphic diagram of Variant 2

3.3 Variant 3: Application of GAN with Minority Class Data

Finally, we opted for dividing the original data based on its class. The data that belong to minority class is used to feed GAN, and synthetic data is created following the GAN architecture. By this way, data imbalance issue is addressed and the resulting classification task should be enhanced.

4 Materials and Methods

4.1 Material and Environment

In this work we tested each of the variants proposed in the above section as well as the original GAN and BOSME methodology by expanding original data from a large, real-world dataset consisting of traffic sensor observations and from the different metropolitan areas from the state of California over a period of three months.

The Caltrans Performance Measurement System (PeMS) [1] provides large amount of traffic sensor data that has been widely used by the research communities. We collected traffic events within a three months period from 31st July 2013 to 31st October 2013, for three different metropolitan area of the state of California, i.e., Bay Area, North Central and Central Coast. We divided each traffic event depending their level of risk, i.e., hazard and control. In each case we identified the minority class to proceed with each variant proposed in this work.

The environment in which all testing and training procedure took place is the following. We used Machine Learning oriented sklearn [11] library of Python in a 64 bit Windows operating system running on Intel Core i5-2010U CPU at 1,6 GHz \(\times \) 4.

4.2 Methodology

First of all, we applied each of the three variants proposed in Sect. 3 as well as the original BOSME and GAN methodologies. By this way, we had 5 ways of generating synthetic data starting from the original databases. Next, each of the expanded databases were used to feed 7 well-known ML classifiers that are listed below. Each of them has been used in the default configuration of sklearn except for the attributes mentioned.

  • DT: Decision Tree (criterion = entropy)

  • RF: Random Forest (number of estimators = 150, criterion = entropy)

  • knn: k-Nearest Neighbors(number of neighbors = 3, weights = distance)

  • GNB: Gaussian Naive Bayes

  • AB: Adaboost (Base Classifier: Decision Tree)

  • MLP: Multilayer Perceptron

  • SVM: Supported Vector Machine

Different performance metrics were used for determining which of the aforementioned techniques for extending the original data fits best with traffic event prediction task. These are accuracy, recall, precision, F1 score and AUC (Area Under the ROC Curve). AUC is the area below the ROC curve, i.e., a graph showing the performance of a classification model at all classification thresholds. What is plotted in the curve is the FPR and TPR in the x and y axes, respectively, whose definitions are given in Eq. 4 and 5. The definitions of the rest of the metrics mentioned above are given in Eqs. 1 ,3 and 6, where TP, TN, FP, and FP stand for True Positives, True Negatives, False Positives, and False Negatives, respectively.

$$\begin{aligned} Accuracy (Acc) = \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(1)
$$\begin{aligned} Recall (Re) = \frac{TP}{TP+FN} \end{aligned}$$
(2)
$$\begin{aligned} Precision (Pr) = \frac{TP}{TP+FP} \end{aligned}$$
(3)
$$\begin{aligned} True Positive Rate (TPR) = \frac{TP}{TP+FN} \end{aligned}$$
(4)
$$\begin{aligned} False Positive Rate (FPR) = \frac{FP}{FP+TN} \end{aligned}$$
(5)
$$\begin{aligned} F1-score (F1) = \frac{2*Pr*Re}{Pr+Re} \end{aligned}$$
(6)

5 Experimental and Results

Each of the methodologies cited in this work were tested for data augmentation of original traffic event databases. For 10 different seeds a 10-fold cross validation were developed in each case to obtain all performance metrics. In each table, in the first column the metropolitan area of event detection and the data augmentation methodology applied are given, where BA, NC and CC stand for Bay Area, North Central and Central Coast respectively. Vx stand for x variant we proposed in Sect. 3, and Original means that the evaluation was done using the original database. The abbreviation of each classifier is given in Sect. 4.

Table 1. Accuracies of different classifiers for different data augmentation techniques for different areas.

As it could be observed in Table 1 in the majority of the cases the application of GAN or BOSME independently outperforms the variants we proposed in terms of accuracy. This metric is not very representative, because the original database also is useful. In fact, due to the class imbalance in these databases, the accuracy does not degrade, i.e., the few minority class instances could be wrongly classified even offering a good accuracy overall. Other metrics are needed to obtain more general conclusion, so we opted for attending Precision, Recall, F1-Score and AUC. As the most determining action is the correct classification of instances from the minority class, Recall is the most representative metric, since it determines how good is a classifier predicting a positive instance as positive, i.e., it defines a ratio between instances classified as positive ones and all positive instances. As shown in Table 2 in the Original database cases the performance metric drops significantly. Thus, it is necessary more instances from the minority lass for the proper training of each of the classifiers.

Table 2. Recall of different classifiers for different data augmentation techniques for different areas.

Other interesting metric to be observed is the AUC. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example. In this case, the original database offers the worst performance given the moderate percentage of random negative instances due to the generalization as a consequence of data imbalance. Attending the rest of the variants mentioned and proposed within this work, we can not deduce which is the best given that in each area for some method some variants perform better than others and vice-versa for other methods. For instance, variant 2 suits best for Random Forest classifier, whereas for knn classifier is the worst option. The application of GAN and BOSME independently offers a regular performance between different classifiers. However, the highest percentage is obtained by the combination V3-AB in Bay Area, V1-AB in North Central and V2-AB in Central Coast, which means that our variants are the most adequate applying the best classifier. Table 3 shows all these measurements of the aforementioned metric.

Table 3. AUC of different classifiers for different data augmentation techniques for different areas.

6 Discussion and Conclusion

In this work we realised the importance of having a balanced data in classification tasks in order to avoid generalisation of the resulting training of the classifiers. In different environments an incorrect classification of an instance belonging to minority class could have a critical impact. Thus, a data preprocessing is needed to extend minority class instances and address this issue.

First, we looked through the accuracies of different classifiers after applying every data augmentation methodology described in previous sections. We saw that there was no evident difference between the application of different method for balancing data or starting from the original database training the classifiers. The low number belonging to the minority class was causing this, their incorrect classification not degrading severely.

However, if we look other metrics such Recall or AUC, we can figure out the importance of these data augmentation techniques. By this way, the classifiers have enough instances from both classes for training phase, and the problem of generalisation is tackled. In each metropolitan area used analysed in the experimental process the original data augmentation techniques perform more regularly than the variants proposed in this work. In fact, if we determine their goodness attending their overall performance metric within all the classifiers we can deduce that they outperform the variants we proposed. Nevertheless, the highest AUC values where obtained by one of the variants proposed in Sect. 3 for each metropolitan area.

For Bay Area, the highest AUC value was obtained after applying our third variant, i.e., the application of GAN for the minority class instances, and posterior use of Adaboost (Decision Tree as base classifier) as classifier.

For North Central area, our first approach gives the highest AUC value, i.e., the use of the new instances proceeding from BOSME and the instances proceeding from the generator as the entry for the discriminator, and the posterior use of GAN for the creation of new instances. Finally, Adaboost (Decision Tree as base classifier) was used as classifier.

In case of Central Coast, the second approach gives the best AUC value, i.e., the use of the new instances proceeding from BOSME as the entry for the discriminator, and the posterior use of GAN for the creation of new instances. Finally, Adaboost (Decision Tree as base classifier) was used as classifier.

To summarize, the power of the data augmentation techniques as preprocessing tool in data imbalance environments has been exceedingly demonstrated in this work. The adequateness of each variant proposed in this work depends on the characteristics and distribution of the original database, and the posterior machine learning model to be adopted for the classification task. For more complex models such as Adaboost or Random Forest where more than a single classifier are evaluated our variants outperform the original GAN and BOSME. Although, the highest values of AUCs are obtained by one of these variants in each metropolitan area the overall performance in more simple models is better for the simplest data augmentation methodologies. Depending the application or the limitations of the hardware to be deployed all the system, some options would be more adequate than others. For instance, if we have to adjust the models size or the training time is critic lighter models should be used and the original GAN and BOSME would be the option to adopt in these cases. In contrast, if there are no such restrictions, the possibility of finding the best classifier and the best variant for addressing data imbalance issue would be the best alternative. Following this line, finding an automatic way of finding the best combination of data augmentation and classification model would alleviate big part of finding the best alternative, improving system’s time efficiency.