1 Introduction

Numerous applications require filtering or detecting abnormal observations in data. For instance, in security, intruders are abnormalities (Ribeiro et al. 2016; Pimentel et al. 2014; Luca et al. 2016; Phua et al. 2010; Yeung and Ding 2001); in traffic data, road accidents (Theofilatos et al. 2016); in geology, the eruption of volcanoes (Dzierma and Wehrmann 2010); in food control, foreign objects inside food wrappers (Einarsdóttir et al. 2016); in economics, bankruptcy of a company (Fan et al. 2017); or in neuroscience, an unexperienced stimulus is considered an abnormality (Kafkas and Montaldi 2018). In some situations, the abnormalities are called rare events, anomalies, novelties, outliers, exceptions, aberrations, surprises, peculiarities, noise or contaminants among others. Of these, the most common terms in the literature are rare event, anomaly, novelty and outlier.

Considering the importance of abnormalities in different areas, a lot of research has been done, mainly in the last 10 years. However, the fact that these contributions have been carried out in different knowledge areas, a mix-up between names and problems has occurred in the literature. Particularly, when the same term is used in distinct disciplines but with other meaning and vice versa. Moreover, the terminology has changed over time and even in the same discipline; a similar problem has been named differently in different time periods. On the one hand, different names have been used for similar problems. For instance, Van Den Eeckhaut et al. (2006) deal with a problem of predicting, in a fixed period of time, the risk factor of a landslide in an area. The authors create a landslide susceptibility map in which each area is scored based on the risk of a landslide. This is done using historical data of either normal and ground which has suffered a landslide (abnormal). In this study, the authors refer to landslides as rare events because landslides seldom occur. In Ribeiro et al. (2016) a similar problem is addressed, but with a different term. Here, a study in the railway industry is carried out. Train passenger doors have several subsystems in order to keep them open or closed according to a variety of safety and comfort rules. In some situations these doors fail due to the deterioration of the system. Therefore, the authors predict whether the door is going to fail in a fixed period of time or not. In order to do that, both normal and failure historical data is used to learn a model. In this case, the door failures are referred to as anomalies. As can be seen, both problems are very similar and different terms have been used to refer to the abnormalities. In both problems, temporal data of normal and abnormal classes is available to build the prediction models.

On the other hand, the same terms have been used to describe widely different problems. In the following two problems, the authors use the term novelty to describe the abnormalities. In Luca et al. (2016) a variety of patients are constantly monitored with a 3D accelerometer. Those patients eventually suffer an epileptic seizure. Due to abrupt movement during a seizure, the patient could became injured. Therefore, detecting this behavior as soon as possible is relevant in order to avoid this harmful situation. In order to predict if a patient is suffering an epileptic seizure, a model is built based on the recorded movement data of several patients. The data consists of 3D accelerometer data divided in fixed time windows in which whether or not an epileptic seizure has occurred is annotated. However, notably less abnormal (seizure) data is available due to the eventuality of these attacks. In the prediction phase, given new information about a currently monitored patient, the classifier detects if the patient is suffering an attack at that moment. Einarsdóttir et al. (2016) detect foreign objects inside food envelopes. A classifier is learned only from food-images without abnormal objects. In other words, the model is learned using information of only one class. However, in the detection phase, the model classifies new instances in two classes, normal (without foreign objects) and abnormal (with foreign objects). While both examples are named with the same term, the problems are widely different. For instance, the former has both normal and abnormal data available to train the model, whereas the latter only learns from a dataset with observations of only one class. A summary of the aforementioned examples is exposed in Table 1.

As we have seen in the previous paragraphs, there is an important mix-up between terms and problems. Possibly motivated by the same mix-up detected by us, some papers that present specific learning methods have made an effort in their introduction section to discuss the differences between one or two terms, or to clearly define their learning scenario. However, to the best of our knowledge, no paper in the literature has treated the four rare event, anomaly, novelty and outlier terms under the supervised classification point of view. For instance, in Luca et al. (2016); Dufrenois and Noyer (2016) a brief discussion about the novelty term and one-class classification framework is made. In Weiss and Hirsh (1998), the authors clearly define their rare event learning scenario. In Campos et al. (2016), an effort is made to distinguish between one class classification and outlier detection. Finally, in Ribeiro et al. (2016), three methods related with outlier, anomaly and novelty detection learning scenarios are used to solve the same problem. Also, some insights are given about all these three learning scenarios. However, none of these papers frame the corresponding terms into the supervised classification framework.

This confusion calls for the repetition of research and hinders the advance of the field. Therefore, the aim of this paper is to contribute with a first step in the organization of the area. In order to do that, this work underlines the differences between each term, and organizes the area by looking at all these terms under the umbrella of supervised classification. Particularly, for each term, the most frequently used learning scenario is associated.

Table 1 An illustrative example of the mix-up between terms and problems in the literature

This paper is organized as follows. Each section describes a supervised learning scenario: Sect. 2 describes rare event detection, Sect. 3, anomaly detection and Sect. 4, novelty detection. In each section, the objective of the classification task, the characteristics of the input data and the most popular techniques for the described learning scenario are reviewed. In Sect. 5, the related outlier term is treated. In Sect. 6, the one-to-one assignment of terms to learning scenarios is described coupled with a brief discussion about the main evaluation techniques of each learning scenario. In Sect. 7, the experimental validation is described. Finally, in Sect. 8, the conclusions of this work are exposed.

2 Rare event detection

Almost all the papers that use the term rare event to describe the abnormalities of the problem to be solved share the time dimension as a common characteristic (Theofilatos et al. 2016; Heard et al. 2010; Dzierma and Wehrmann 2010). For instance in Theofilatos et al. (2016), a road accident study in the Attica Tollway (Greece) is performed. The authors divide the tollway into different sections and they detect the occurrence of an accident in a certain section of the highway. A model is built based on recorded data from ground-sensors and traffic-cameras. More specifically, the data is sliced into one-hour time intervals and manually labeled by experts. Therefore, given a new one-hour time interval, the model detects an accident occurrence. In Dzierma and Wehrmann (2010), a geomorphological study is performed. The authors predict if a new volcano eruption is going to happen in a fixed period of time. A Poisson Process is learned with the historical Volcanoes Explosivity Index (VEI) of two volcanoes. Next, given a new VEI of one of the two volcanoes, the occurrence of the eruption in a fixed time interval is predicted.

In the previously described problems, the goal consists on the prediction of occurrence of a rare event in a bound period of time. A genuine characteristic of the rare event learning scenario, from a supervised classification point of view, is that the instances are time series (Hamilton 1994). From this perspective, the objective is to classify new incoming time series as rare (when the rare event has occurred) or normal (no event has occurred) using a previously learned model. This approach is known in machine learning as supervised time series classification (Esling and Agon 2012). However, due to the temporal nature of the problem, two different classification approaches can be found in the literature. Firstly, the full length supervised time series classification is dealt with. For example, in Murray et al. (2005), the SMARTFootnote 1 dataset is used to detect if a hard-drive is faulty in a fixed period of time. The authors learn a model using recorded hard-drive sensor measurements at different times. Then, given new hard-drive sensor data, failure is detected. In Zhang et al. (2017), a termo-technology dataset which contains information gathered over time about heating systems is used. The objective is to detect if the heating system has failed in a fixed period of time. Secondly, another type of classification of rare events can be found in the literature, in which the objective is to classify new observations (time-series) as early as possible, preferably before the full time series is available. This approach is known as early supervised time-series classification in machine learning literature (Mori 2015). For example, in Ogbechie et al. (2017) a prediction of faulty metal bars is studied. During the bar melting process, several sensors monitor the characteristics of each bar. These measurements, recovered from both normal and faulty bars, are used to learn a model. Next, given information about a new bar, the classifier predicts if the bar is going to be faulty. The early detection of a faulty bar is crucial because, depending on when it is detected, it can be fixed during the rest of the process.

According to the characteristics of the data, in most of the problems referred to with the rare event term, instances are time series and are labeled in two categories: normal (\(\mathcal {N}\)) and rare (\(\mathcal {R}\)). Furthermore, in many papers, the data shows an unbalanced distribution of classes. Formally, assuming that the data is generated by a generative mechanism \(P(\mathbf {x},c)\) (Mitchell 1997), \(P(C=\mathcal {R}) \ll P(C=\mathcal {N})\). Considering the instances during the training stage, both normal and abnormal instances are available to learn the classifier. Therefore, rare event classification can be formalized as a (highly) unbalanced supervised time series classification problem (Köknar-Tezel and Latecki 2011; Cao et al. 2011). Formally, this scenario can be described as follows:

A time series (TS) is an ordered pair (timestamp, value) of fixed length m:

$$\begin{aligned} TS&=\{(t_1,x_1),\ldots ,(t_i,x_i),\ldots ,(t_m,x_m)\} \nonumber \\&\text {with } t_i \in \mathbb {N} \text {, for } i=1,\ldots ,m \end{aligned}$$
(1)

Time series classification is a supervised data mining task in which giving a training set of time series, \(\mathbf {TR}=\{(\mathbf {TS}_1, y_1), \ldots , (\mathbf {TS}_n, y_n)\}\), in which y represents the label of the corresponding time series, the objective is to build a classifier that is able to predict the class label of any new time series as accurately as possible (Mori 2015). In the particular case of a rare event, it is common to have a scenario where \(P( C = \mathcal {R}) \ll P( C = \mathcal {N})\). A common classification process can be seen in Fig. 1.

Fig. 1
figure 1

A flowchart of the supervised time series classification data mining task (Mori 2015)

Besides, there are some problems in the literature in which the prediction must be output as soon as possible. This learning scenario is known as early time series classification (Mori et al. 2018).

However, even though the problem itself has the time dimension as a key component, in some rare event detection applications, instances are transformed without considering this genuine characteristic. Therefore, the approach treats the problem as an unbalanced non-temporal classification task, similar to those found in the anomaly detection learning scenario (further described in Sect. 3). For instance in Murray et al. (2005), the data is composed of several hard drive sensor measurements at different time intervals. Therefore, for the same drive, many readings of the same sensors are available. However, the authors do not consider the order in which the measures have been recorded, and, given new hard-drive unordered measurements, the model classifies the drive as faulty or normal. Hence, the temporal nature of the data is not leveraged.

Regarding the rare event literature, the objective of most of the related papers is focused on classifying the rare class. Therefore, in order to evaluate the performance of the classification task, popular metrics such as AUC (Zhang et al. 2017; Xu et al. 2016; Ren et al. 2016) and the recall of the rare class (Zhang et al. 2017; Ren et al. 2016) have been commonly used.

Among the most frequently used techniques in time series classification, rare event logistic regression, an adaptation of the logistic regression for this learning scenario, is a popular choice (King et al. 2001; Theofilatos et al. 2016; Ren et al. 2016; Van Den Eeckhaut et al. 2006). However, techniques such as Kullback-Leibler divergence to discriminate between rare and normal events (Xu et al. 2016), long-short term neural networks (Zhang et al. 2017), rule-based classification learned with genetic algorithms (Weiss and Hirsh 1998), multiple-instance naïve Bayes (Murray et al. 2005), Poisson Processes (Dzierma and Wehrmann 2010), support vector data regression with surrogate functions (Bourinet 2016), Bayesian networks (Cheon et al. 2009) or support vector machines (Khreich et al. 2017) have been successfully adapted for this learning scenario.

Taking into account the unbalanced distribution of classes, most of the previous methods are coupled with techniques specifically designed to deal with unbalanced time-series classification. Some of these techniques include: the Structure Preserving Over Sampling (SPO) technique (Cao et al. 2011), or an adaptation of the classical Synthetic Minority Over-sampling TEchnique (SMOTE) (Köknar-Tezel and Latecki 2011).

Finally, another widely different rare event related learning scenario can be found in the literature. The estimation of the probability of occurrence of a rare event (Wu et al. 2003; Cadini et al. 2017; Dessai and Hulme 2004; Dueñas-Osorio and Vemuru 2009; Bedford and Cooke 2001). This approach is mainly used in engineering and physics and some illustrative examples of rare event probability estimation include: the estimation of the probability of infrastructure failure in a fixed period of time (Dueñas-Osorio and Vemuru 2009), the estimation of the probability of failure of technical systems in a fixed period of time (Bedford and Cooke 2001), or the estimation of the probability of extreme climate developments in a specific time window (Dessai and Hulme 2004). Since this learning scenario is beyond the supervised classification framework, it is not considered in this paper. Among the most frequently used techniques in order to estimate the rare event probability, importance sampling, Monte Carlo simulations (Balesdent et al. 2016; Auffray et al. 2014), kriging (Auffray et al. 2014) or first order reliability method (FORM) (Straub et al. 2016) are found in the literature.

3 Anomaly detection

Most of the problems which describe the abnormalities with the anomaly term are non-temporal. The data is labeled in two categories: normal (\(\mathcal {N}\)) and anomaly (\(\mathcal {A}\)). For instance, in Miri Rostami and Ahmadzadeh (2018), the authors detect breast cancer using the Surveillance Epidemiology and End Results (SEER)Footnote 2 dataset. This dataset consists of patients which have been examined for cancer diseases. The patients which suffer from cancer are described with anomaly term. Hence, the data consists of cases of both normal and anomalous instances. A model is then learned which classifies new unseen cases as anomalous or normal. Fiore et al. (2017) detects credit-card transactions using a public dataset with legal and notably less fraudulent transactions. A neural network is learned to classify new incoming transactions as legal or fraudulent.

In anomaly detection learning scenario, anomalous instances are scarce due to the unbalanced distribution between normal and anomaly classes (Chandola et al. 2009). Therefore, this scenario can be formalized as (highly) unbalanced supervised classification. Formally, an instance is defined as \(\mathbf {x}=(x_1,\ldots ,x_m)\). Given a training set \(\mathbf {TR}=\{ (\mathbf {x}_1, y_1), \ldots , (\mathbf {x}_n, y_n)\}\), in which y represents the label of the corresponding instance, the objective is to learn a classifier that is able to predict a new class label of any new instance as accurately as possible. Regarding the probability distribution of the class variable, \(P(\mathcal {A}) \ll P(\mathcal {N})\). Where \(\mathcal {A}\) represents the anomaly class label and \(\mathcal {N}\) the normal class label. An illustrative example can be seen in Fig. 2.

Fig. 2
figure 2

A flowchart of a supervised classification task. This learning scenario is assigned to the anomaly detection term

In order to evaluate the performance of the classifiers, due to the (highly) unbalanced distribution of classes, common metrics such as accuracy are not informative enough. Therefore, authors focus on the correct classification of abnormalities. A popular evaluation measure used is the maximization of the recall of the minority class (Ribeiro et al. 2016; Miri Rostami and Ahmadzadeh 2018).

For anomaly detection, popular supervised classifiers have been adapted obtaining competitive results. For instance, support vector machines (Zhou et al. 2017), neural networks (Noto et al. 2012) or Gaussian mixture models (Reynolds 2015) present genuine algorithms to deal with anomaly detection domains. Note that, since anomaly detection can be formalized as a (highly) unbalanced supervised classification problem, techniques that specifically deal with unbalanced domains can be used for anomaly detection. Similar to the rare event oversampling techniques, SMOTE (Miri Rostami and Ahmadzadeh 2018; Araujo et al. 2018), is widely used to synthetically generate instances from the minority class.

4 Novelty detection

In most of the papers that use the term novelty to describe the abnormalities, the model is learned using a dataset that contains only one class. For instance, in Khreich et al. (2017), system call traces are classified as novel or normal. A novel instance corresponds to an unsupported or unexpected system call trace. To learn the model, only normal system call traces which have been gathered in a secure environment are used. When a new system call arrives, the classifier predicts it as normal or novel. Similarly, in Einarsdóttir et al. (2016), a study in food control is carried out. Specifically, in some cases, foreign objects can be found inside food envelopes. Since this situation can result in bad customer experience and legal issues, the detection of foreign objects is crucial. The authors learn a classifier using X-ray images only from normal food (without foreign objects inside). Next, giving a new unseen X-ray image, the classifier predicts it as novel (with foreign object inside) or normal. The novelty term has also been commonly used in streaming scenarios. Masud et al. (2013) start from a labeled dataset, where an initial model is learned. This model classifies the new incoming instances either among the normal known classes or as novel (the instance is not similar to any known class). If this new instance is classified as novel, it is kept in a buffer because it is considered as a candidate for a new class. When this buffer is full, new classes are sought in this buffer. The classifier is updated with new emerging novel classes for future predictions.

Regarding the two aforementioned problems, two different learning scenarios can be considered. What we call the static novelty detection learning scenario is considered. Here, the problem can be cast as a binary supervised classification problem. Given a dataset composed by only one class, a model is built. This model learns a decision boundary that isolates the normal behavior. For prediction, when a new instance arrives, it is classified as novel or as normal. In this framework, the efforts are focused on correctly classifying the normal class (Pimentel et al. 2014; Einarsdóttir et al. 2016; Kafkas and Montaldi 2018). Therefore, in order to evaluate the performance of the classifiers, the recall of the normal class is commonly maximized (Swarnkar and Hubballi 2016; Luca et al. 2016). Formally, the training set is generated only from \(P(\mathbf {x} | C=\mathcal {N})\). At the training stage, even though the classifier is learned using information about only one class (normal class), it is built considering that another behavior exists which is different that which is normal.

Formally, an instance is defined as \(\mathbf {x} = (x_1, \ldots , x_m)\). Given a training dataset, \(\mathbf {TR}=\{(\mathbf {x}_1, y_1=\mathcal {N}), \ldots , (\mathbf {x}_n, y_n=\mathcal {N}) \}\), the objective is to learn a classifier that will be able to predict between normal \(\mathcal {N}\) and novel. Note that, in this learning scenario, only one class, the normal class \(\mathcal {N}\), is available to train the model. An illustrative example can be graphically seen in Fig. 3.

Fig. 3
figure 3

A flowchart of the supervised classification framework task in which only one class is available to learn the classification model. This learning scenario is assigned to the the static novelty detection term

Besides, what we call dynamic novelty detection is considered. In some situations, in the literature, it is also known as evolving classes, future classes or novel class detection (Masud et al. 2013; Mu et al. 2018; Faria et al. 2016). This learning scenario can be formalized as a supervised classification problem in which the number of labels for the class variable is unknown. In other words, the generative probability distribution dynamically changes during the classification process. Therefore, the classifier has to adapt to these changes. When a new instance arrives, the model has to classify among the current classes or it stores it in a buffer (Masud et al. 2013; Spinosa et al. 2007; Zhu et al. 2018). Considering the life-cycle of the classes, these can drift, be born, die or reappear. Hence, the classifier must be updated for those changes, considering that the adaptation time is relevant in a streaming environment. Note that most of the existing approaches consider a dynamically (highly) unbalanced supervised classification scenario (Masud et al. 2013; Spinosa et al. 2007; Chen et al. 2008; Zhu et al. 2018) since a few instances may constitute a new emerging class (Fig. 4). To evaluate the performance of the classifier in this environment, genuine metrics have been proposed. For instance, Masud et al. (2013) use the percentage of novel class instances classified as a current class; the percentage of existing class instances falsely identified as novel; and, the total misclassification error. Zhu et al. (2018) use the average precision among all classes. Chen et al. (2008) output the evolution of the classification error as new events occur: the emergence of a class, disappearance or drift.

The dynamic novelty detection learning scenario can be divided in two stages. Firstly, the initial learning stage (also known as offline stage), in which given a labeled training dataset, a model is built. Secondly, the prediction stage (also known as online stage), in which new classes may emerge and disappear, and the old classes may also drift. These two phases are formalized as follows:

Initial training phase (offline) In the offline phase a classifier \(C_0\) is learned considering a set of labels \(L_0\).

Prediction phase (online) The online phase can be described as a prediction and adaptation stage in which a data stream (DS) is observed. A DS is a possibly infinite sequence of instances. At time t, the current classifier \(C_t\) predicts a new instance. If the instance is classified as one of the current labels, the classifier is adapted with this knowledge to create \(C_{t+1}\). If the new instance can not be classified in the current set of labels, it is kept apart in a buffer and the model does not modify. Once the buffer is full the classifier is updated and the set of labels \(L_t\) modified.

An illustrative flowchart of this learning scenario can be seen in Fig. 5.

Fig. 4
figure 4

Flowchart of the dynamic novelty detection learning scenario. At the beginning, the given classes are modeled. When new instances arrive, they are classified among known classes or they are rejected as not belonging to any existing class (see the crossed instances in b). Finally, the new emerging class is sought. a Initial training example. b Classification of instances. c Class discovery.

Fig. 5
figure 5

A flowchart of the dynamic novelty detection problem. In this problem, the number of labels for the class variable is unknown, and dynamically changes over time

According to the techniques used in static novelty detection, one class classification techniques are those which are the most representative ones in this learning scenario. For instance, one class SVM (Dufrenois and Noyer 2016; Erfani et al. 2016; Khreich et al. 2017), K-Nearest Neighbors data description (Tax 2001), graph embedded one class classifiers (Mygdalis et al. 2016), one class Random Forests (Désir et al. 2013) and Isolation Forest (Zhang et al. 2011) have been successfully applied under the static novelty detection learning scenario. Besides, in dynamic novelty detection, techniques such as OLINDDA (Spinosa et al. 2007), a sphere-based novelty detection algorithm, in which clustering is done with the k-means algorithm; MuENLForest (Zhu et al. 2018) which discovers new labels in a multi-label classification framework by creating an ensemble of Random Forest and Isolation Forest classifiers to discover emerging new classes; or the ensemble proposed in Masud et al. (2013), which creates an ensemble of decision trees which, in each leaf node, runs a k-means algorithm to discover sphere-shaped emerging new classes, have been successfully proposed in the literature.

5 The related outlier detection scenario

The outlier term also comes up when seeking related works with rare event, anomaly and novelty terms. While the term is mainly associated with an unsupervised framework, the literature shows examples where the term is used to name other previously explained scenarios (Hodge and Austin 2004; Zhang and Zulkernine 2006; Billor et al. 2000). Therefore, it is briefly considered in this section.

In some papers, the term outlier has been related with noise, linking these observations with incorrect or inconsistent behaviors (Aggarwal 2017). Consequently, the outlier detection task forms part of a preprocessing phase (Teng et al. 1990; Rousseeuw et al. 2011). For instance, when human errors are introduced retrieving data, these erratic observations are considered outliers (Barai and Lopamudra 2017). In other situations, the detection of instances with high deviation are considered outliers (Radovanović et al. 2015; Dang et al. 2014). In Radovanović et al. (2015), the authors detect all-star players in an unlabeled dataset composed by NBA players between 1973 and 2003. The outstanding NBA players are considered outliers. In order to detect them, clustering is pursued and those points which deviate significantly from others are considered outstanding NBA players.

Regarding the characteristics of the data in the outlier detection scenario, it can be either temporal (time-series) (Gupta et al. 2013) or non-temporal (Aggarwal 2017; Campos et al. 2016; Radovanović et al. 2015).

An outlier detection task can be formalized as an unsupervised classification problem. Formally, given a dataset \(D=\{\mathbf {x}_1, \ldots , \mathbf {x}_n\}\), the objective is to find the instance that (highly) deviates from others. An example of an outlier detection task can be seen in Fig. 6.

Fig. 6
figure 6

An example of unsupervised classification. This learning scenario is assigned to the outlier detection task. As can be seen the outliers are deviated instances without a clear pattern

6 The proposed assignment of terms and learning scenarios

In this paper, based in our experience and initial approach to the literature (see the list of key references at the end of the paper), we did discover two major issues: a) the existence of a problematic mix-up between terms and learning scenarios. And b) we realize that most of these problems can be put in the same learning framework. Furthermore, we based on the assignment of terms to problems in these key papers to design our taxonomy. For each paper, we have reviewed the goal of the paper, the characteristics of the input data and the most representative techniques used in each rare event, anomaly, novelty and outlier detection works. Concretely, for each term related paper, the problem that the authors want to solve, such as, whether it is a time series classification, has an unbalanced learning characteristic, it is a classification task or a regression task, which the evaluation measures are, and, if it is a supervised or unsupervised classification problem has been reviewed.

In Fig. 7, the assignment of terms to learning scenarios is graphically explained. As can be seen, each term is associated with one learning scenario. Moreover, the genuine characteristics of each learning scenario are shown in this figure. Also, an extended summary is exposed in Table 2.

Table 2 Summary of the main characteristics of each term along with the key references of the literature
Fig. 7
figure 7

Assignment of terms to learning scenarios. The main characteristics of each learning scenario have been summarized

In the case of the rare event term, the most relevant learning scenario under the supervised classification point of view is the (early) time series classification. In most of the papers described with the rare event term, there is a temporal nature in the problem, the classes are unbalanced and all the classes are represented in the training set.

In the problems described with the anomaly term, the most relevant learning scenario is the (highly) unbalanced supervised classification. In this learning scenario, the data is static, the distribution of classes is unbalanced, and all the classes are represented in the training set.

Regarding the problems described with the novelty term, two different learning scenarios are considered. On the one hand, the static novelty detection in which the objective is to classify an instance between novelty or normal based on a model which has been trained with only the normal class. On the other hand, the dynamic novelty detection is considered. In this learning scenario, the objective is to discover new emerging classes in an streaming environment. However, both learning scenarios share some common characteristics, such as: both of the learning scenarios are supervised, and, both of them try to discover instances from classes that were not available in the training set. Hence, both of the learning scenarios do not have all the classes represented in the training data (in the case of static novelty detection, the novel class is not available. In the case of dynamic novelty detection, the new novel classes are neither available in the training set).

Finally, the outlier detection term has been mostly associated with the unsupervised classification framework in the literature.

All these learning scenarios require specific measures in order to evaluate the performance of the classifiers that solve the related problems. Therefore, depending on the objective of the classification task, different measures are commonly computed in the literature. In Table 3, the most common evaluation measures for each term are presented. Regarding the evaluation techniques used to validate the performance of the classifier, in the majority of the papers the k-fold cross validation, stratified k-fold cross validation and the train and test split are used.

Table 3 Summary of the most used evaluation measures of each term related learning scenario

7 Validation of the proposed assignment

In order to validate this proposal of assignment, an experiment has been carried out considering two different scenarios. In the first scenario, the most cited papers after the year 2000 have been gathered; while in the second scenario, the first search-results after 2014 have been considered. In both scenarios, for each paper, two terms are obtained. On the one hand, that used by the authors to describe the problem, and on the other hand, that which would have been assigned with our taxonomy. In this way, a confusion-like matrix has been formed for every scenario.

In order to retrieve these papers, Google Scholar, ACM Digital Library and IEEE Xplore search engines have been used individually. Hence, the experiment is replicated for each individual search engine. In this way, the possible differences between these three communities have been checked.

The goal of the experiment is two-fold. Firstly, we would like to validate the presented proposal of assignment of terms to learning scenarios, and check when it matches the majority of the literature papers. Secondly, we would like to identify the most frequently confused learning scenarios between pairs of terms. Finally, we have also tested if the confusion varies between different communities and, hence, different search engines have been considered.

According to the confusion matrix of the most cited papers (Tables 4, 5, 6), the terms used to describe the different types of abnormalities mostly match our proposal of assignment. However, in some situations, we have found discrepancies. Particularly, the highest discrepancies are found between the anomaly and rare event terms. In the case in which the authors use the anomaly term, it is frequently confused with our standardization of the rare event term. After checking the related literature, we realize that this happens when the problem has a temporal nature. Therefore, these problems would have been described with the rare event term regarding our proposal of assignment of terms. Similarly, these discrepancies are found in problems described with the rare event term but on the opposite side. When the novelty term is used by the authors to refer to the abnormalities of their problems, a minor set of papers are confused with our concept of anomaly term. In these works, instances of the novelty class are available during the training stage. Consequently, according to the presented proposal of assignment, their learning scenario is associated with the anomaly term. Finally, considering the outlier term, only a few situations are found in which the outlier detection learning scenario has been confused with the novelty detection one. In these mismatched works, a normality model is learned from labeled data. Then, instances non-conforming the normal behavior are rejected and considered outliers. Based on our proposal, this learning scenario corresponds with novelty detection.

Table 4 The confusion-like matrix formed from the results obtained from Google Scholar. For each term, the 50most cited search-results (papers) have been analyzed after the year 2000. The terminology used by the authors is compared with respect to our proposal of assignment of terms to learning scenarios
Table 5 The confusion-like matrix formed from the results obtained from the ACM Digital Library. For each term, the 50most cited search-results (papers) have been analyzed after the year 2000. The terminology used by the authors is compared with respect to our proposal of assignment of terms to learning scenarios
Table 6 The confusion-like matrix formed from the results obtained from the IEEE Xplore search engine. For each term, the 50most cited search-results (papers) have been analyzed after the year 2000. The terminology used by the authors is compared with respect to our proposal of assignment of terms to learning scenarios

In the second scenario with the first search-results of each term after 2014 (Tables 7, 8, 9), a similar trend can be seen. However, there is some increase in the discrepancies. The confusion of the use of the terms novelty and anomaly is noticeable. For instance, the anomaly and the novelty problem descriptors have been confused in more situations than in the previous experiment with the subset of most cited works.

Table 7 The confusion-like matrix formed from the results obtained from Google Scholar. For each term, the first 50 search-results (papers) after the year 2014 have been analyzed. The terminology used by the authors is compared with respect to our proposal of assignment of terms to learning scenarios
Table 8 The confusion-like matrix formed from the results obtained from the ACM Digital Library. For each term, the first 50 search-results (papers) after the year 2014 have been analyzed. The terminology used by the authors is compared with respect to our proposal of assignment of terms to learning scenarios
Table 9 The confusion-like matrix formed from the results obtained from the IEEE Xplore search engine. For each term, the first 50 search-results (papers) after the year 2014 have been analyzed. The terminology used by the authors is compared with respect to our proposal of assignment of terms to learning scenarios

Regarding the different search engines, it can be seen that the mix-up is more prominent in the ACM community. Particularly, in the first 50 search-results (Table 8), it can be seen that the mix-up between the outlier term is considerably higher than in other communities. However, this trend can not be seen in the 50 most cited papers (Table 5). Moreover, the novelty term also shows a slightly higher confusion in this community.

It can be concluded that the proposed assignment of terms to learning scenarios is supported by the literature. In addition, the confusion matrices reveal the mix-up between terms and learning scenarios. This clearly promotes the repetition of works and hinders the progress of the field. Furthermore, due to the popularity and increase of contributions in these term-related fields in recent years, this confusion is increasing. Therefore, we think that the standardization of the field is necessary and, with this review, we try to take a short step towards the solution of this mix-up.

8 Conclusions

In this paper, we have underlined those genuine characteristics of each rare event, anomaly, novelty and outlier terms that are shared by the majority of the papers in the literature and have been assigned to a learning scenario. In order to do that, we have reviewed the different aims of each paper, the characteristics of the input data and the most representative techniques used in each rare event, and anomaly and novelty detection works. Each term has been accompanied with a set of illustrative applications to highlight the different learning scenarios. We have argued that the learning scenarios associated to the reviewed terms can be formalized under a supervised classification framework. Finally, we hope that the discussion with the closely related outlier term can enrich the comprehension of each scenario. Finally, the main characteristics of terms and problems have been summarized in Table 10. In this table, both the features related with the available data and the characteristics of the problem have been distinguished.

Table 10 Summary of the principal characteristics, extracted from the literature, of the reviewed terms and learning scenarios

With this paper, we take a short step towards the standardization of the rare event, anomaly, novelty and outlier terms. We think that our proposed assignment of terms to learning scenarios can help to resolve the muddle which hinders the progress in the term-related fields. Also, we think that the standardization of the terms and learning scenarios can strongly help to improve the progress in the field by letting the community (and especially young, newcomer researchers) to easily find what they are looking for, and by avoiding the repetition of works.