1 Introduction

Anomaly detection techniques aim to find patterns that do not conform to expected behavior in the data set (Chandola et al. 2009; Huang 2013). These patterns are often called anomalies, outliers, abnormal changes, surprises or discords in different contexts, frequently arising in real-world applications such as bioinformatics and finance (Huang 2013; Chandola et al. 2008a, b; Keogh et al. 2005). In this paper we present a new anomaly detection method called weighted local outlier factor (WLOF), which is able to extract and weight features in time series.

In the past decades, many anomaly detection methods have been developed in specific application domains, which can be broadly divided into two categories (Beigi et al. 2011): modeling approaches (including rule-based, pattern-matching and model-based approaches), which require the prior knowledge of application domains, and data mining approaches (including similarity-based and statistical approaches), which do not require any prior knowledge of application domains. Hadi used a modeling approach based on statistical estimation of the distribution parameters to identify anomalies in multivariate samples (Hadi 1994). Tandon and Chan (2007) used a parametric statistical modeling approach based on association rule mining-based techniques for network intrusion detection. Keogh et al. (2005) used distance based approaches to identify the anomalies in time series. Sun et al. (2005) proposed an algorithm to compute the neighbourhood for each node in bipartite graphs using random walk with restarts and graph partitioning and then used the neighbourhood information to identify abnormal nodes. Some researchers have combined modeling approaches and data-mining approaches to identify the anomalies in data streams. For example, Chandola et al. (2008a, b) proposed a framework for modeling categorical data with a desired set of characteristics and a set of separability statistics, which are helpful for understanding the performance of similarity measures for outlier detection. In addition, Aydin et al. (2015) proposed a modified kernel-based tracking methods for detecting the anomalies of railway traffic, and Jin et al. (2016) proposed a method for detecting bearing anomalies and fault prognosis using the Kalman filter approach. Moreover several surveys have also been reported in the literature on outlier detection for different application areas (Hodge and Austin 2004; Zhang et al. 2008; Gupta et al. 2014).

The nature of anomalies determines which anomaly detection techniques would be applied. According to the suggestions of Chandola et al. (2009), anomalies can be grouped into three categories as follows. (1) Point anomalies: a data instance is considered as anomalous with the rest of the data, such as in the case of credit card fraud. (2) Contextual anomalies: a data instance is anomalous in a specific context, but not otherwise. Contextual anomalies have been investigated in time series data (Weigend et al. 1995) and spatial data (Kou et al. 2006). (3) Collective anomalies: a collection of data instances is anomalous with respect to the entire data set. Collective anomalies can be found, for example, in electrocardiogram data (Keogh et al. 2005).

In this paper we focus on collective anomalies in different types of sequential data. In order to find the collective anomalies, we need to segment a time series into a set of sub-series of data, i.e. subsequences. Piecewise linear representation (PLR) (Keogh et al. 2001; Yankov et al. 2007; Keogh et al. 2008) is a common feature representation method which has been used to obtain the main features of time series data or data streams. The main idea of the PLR is using the K connective straight lines to represent a time series with length n(K ≪ n). The advantages of PLR are summarized as follows: (1) a low-dimensional index structure and (2) high computational efficiency (Keogh et al. 2001; Yan et al. 2013). In fact, PLR can obtain higher precision with a larger number of segments, but that would require more computation time. Keogh et al. (2001, 2008) also proposed a piecewise aggregate approximation (PAA) method for dimensionality reduction in time series data (Palpanas et al. 2004), which segments a time series using a fixed size window and uses the average value of each sub-segment to collectively represent a time series. Park et al. (2001a, b) used the monotonic sliding windows segmentation algorithm to represent a time series, and demonstrated good results for a smooth time series data. However, real world data often include a great deal of noise and the number of segments required is often very large. Peng et al. (2000) used the landmark model to segment a time series through selecting segment points according to their minimum distance/percentage principle which is a smoothing process and is implemented as a linear time algorithm. Pratt and Fink (2002) proposed an important point segmentation method that compresses a time series by selecting some of its minima and maxima. In this paper we adopt a piecewise linear representation method based on important points (PLR_IP).

Given a new representation of time series data, we also need a method for measuring the difference between data objects (instances) embedded in subsequences in order to detect collective anomalies. Therefore, a PLR method can be used to segment a time series into an alternative representation, and distances of the objects within their neighbourhood can be used to find the anomaly. For instance, Ramaswamy et al. (2000) used the distance in the k-nearest neighbourhood to rank the outliers. Their approach can be used to compute the top n outliers. Breunig et al. (2000) used a local outlier factor (LOF), whose value depends on how isolated objects are with respect to the surrounding neighbourhood, as a measure for determining outliers. Although that approach can find meaningful outliers, there are two issues with the LOF method. One is that it does not work well for those features with different orders of magnitude as the features with large magnitude will determine the results, whereas the features with smaller magnitude will have little effect. Another is that the LOF method can recognize the anomalies in time series data based on their original values (Breunig et al. 2000), but when anomalies are interleaved in regular frequency spectrums or other complex anomalies, the LOF is not able to do so.

In order to address these two issues above, we propose the WLOF, in which all selected features will be taken into account in detecting anomalies. Importantly, we propose to construct four features to represent time series data, three of which are defined on the basis of the PLR_IP, representing three different aspects of a time series. First of all, we average the data points in a subsequence that corresponds to a sliding window. The second and third features are defined as the number of important points and the maximum angle of the subsequence, respectively, which are designed mainly for finding anomalies in regular spectrums. Finally, Lin et al. (2003) used the symbolic aggregate approximation (SAX) method to map a time series into a character string like “cbccbaab”, every character in the alphabet representing the feature of one segment (Keogh et al. 2006). Similarly, to represent a segment with a feature, we propose a new feature which is the difference between the values of important points in a subsequence and then compute the maximum difference between important points in a sliding window which may cover several segments. This feature represents the maximum change in all the segments involved in a sliding window. Therefore these features constitute a core for the WLOF method to find anomalies in time series data.

After presenting the WLOF method in detail, we then present experimental results to evaluate it. The experiments have been carried out over 17 benchmark datasets and the comparative analysis against other approaches to demonstrate the effectiveness of the proposed WLOF method in discovering more anomalies within the time series data.

The paper is organized as follows. In Sect. 2, we introduce the concept of PLR_IP and WLOF. In Sect. 3 we present the experimental results over 17 data sets which show that the proposed method can find local outliers. In Sect. 4 we discuss the effect of different parameters. Finally, Sect. 5 presents conclusions and future work.

2 Methodology

2.1 Notation

2.1.1 Time series and subsequences

Time series or sequential data exist in many real world domains such as commercial, economic, medical, and gene expression data. These domains typically involve large amounts of data and are updated regularly which make it very difficult to detect anomalies directly in the original time series data. Thus, we separate a time series sequence into a set of relatively short subsequences using a sliding window. Firstly, we give some definitions of a time series sequence and subsequences as follows:

Definition 1

(Time series) A sequence of pairs, T = [(Z1t1), (Z2t2), …, (Zntn)], (t1 < t2 < ··· < tn) where Zi is a data point in a d-dimensional data space, and ti is the time stamp corresponding to the time at which Zi occurs (1 ≤ i ≤ n).

Definition 2

(Subsequence Keogh et al. 2005) Given a time series T = [(Z1t1), (Z2t2), …, (Zntn)], a subsequence C of T is a sampling of length m ≤ n of contiguous position from p, that is, \( C_{p,m} = \left[ {\left( {Z_{p} ,t_{p} } \right), \ldots , \left( {Z_{p + m - 1} ,t_{p + m - 1} } \right)} \right] \) for 1 ≤ p ≤ n − m + 1. To get a set of subsequences \( C_{m} = \left\{ {C_{1} ,C_{2} , \ldots ,C_{n - m + 1} } \right\} \), sliding windows can be defined and used, where each subsequence corresponds to a sliding window, where overlap between two adjacent sliding windows can be adjusted on the basis of different applications.

2.1.2 Anomalous features of a subsequence

A subsequence could be anomalous compared with subsequences or contain an anomaly, which can be characterized with various features of the subsequence, such as average value and the maximum difference between values of important points, etc. In this study, four features have been identified. Prior to defining them, we define the extreme points, important points, piecewise linear representation and fitting error.

Definition 3

[Extreme points (Yan et al. 2013)] Given a 1-dimensional time series, T = [(Z1t1), (Z2t2), …, (Zntn)], if (Zi > Zi−1 and Zi > Zi+1) or if (Zi < Zi−1 and Zi < Zi+1), the point (Ziti) is an extreme point.

Definition 4

(Important points) Extreme points are important features of time series, but sometimes the distance between two neighbouring extreme points is too large, making it difficult to find an anomaly. For this reason, we introduce a concept of important points that consist of extreme points plus additional points identified by the following two step procedure. The first step identifies several extreme points that represent largest distances between extreme points in the data, and the second step ensures that the distance between the neighbouring points is not too large. The set of important points is obtained by a two step procedure below.

  • Step 1 Select extreme points as important points. The first and last data points of subsequences are selected as important points. Then suppose that there are L extreme points in T = [(Z1t1), (Z2t2), …, (Zntn)], where L < n. For a specified number of important points required g and parameter β ∊ (0, 1), if \( L \ge \left\lfloor {\beta (g - 2)} \right\rfloor \), \( \left\lfloor {\beta (g - 2)} \right\rfloor \) extreme points are selected as important points iteratively as follows. At each iteration, the data point \( (Z_{r} ,{\text{ t}}_{r}^{{}} ) \) is selected where r satisfies:

    $$ r = \mathop {\arg \hbox{max} }\limits_{j \in FI} D[Z_{j} ,Z_{{i_{j} }} ] $$
    (1)

    where FI is the set of subscripts of extreme points that have not yet been selected as important points, D is a distance measure, and \( (Z_{{i_{j} }} ,t_{{i_{j} }} ) \) is the currently selected important point that is the nearest to (Zjtj). If \( L < \left\lfloor {\beta (g - 2)} \right\rfloor \), all the extremes are selected as important points. Note that since we aim to find the abnormal change of time series and because the change in time t is uniform, this means that the distance between two adjacent data points at t is the same, we only select the important points based on the Z value.

  • Step 2 Select some additional points as important points if necessary. The remaining \( g - 2 - \left\lfloor {\beta (g - 2)} \right\rfloor \) important points are also selected iteratively as follows. Suppose, \( P = \left[ {\left( {Z_{{i_{1} }} ,t_{{i_{1} }} } \right),\left( {Z_{{i_{2} }} ,t_{{i_{2} }} } \right), \ldots \left( {Z_{{i_{l} }} ,t_{{i_{l} }} } \right)} \right] \), where \( t_{{i_{1} }} , t_{{i_{2} }} , \ldots , t_{{i_{l} }} \) is the set of important points which have been selected. At each iteration the data point \( (Z_{h} , \, t_{h} ) \) is selected, where \( t_{h} = \left\lfloor {\frac{{t_{{i_{a} }} + t_{{i_{a + 1} }} }}{2}} \right\rfloor \), \( Z_{h} = Z_{{t_{h} }}^{{}} , \)a is obtained as follows:

    $$ a = { \arg }\mathop {\hbox{max} }\limits_{1 \le j \le l - 1} D\left[ {Z_{{i_{j} }} ,Z_{{i_{j + 1} }} } \right] $$
    (2)

    i.e. we identify the largest distance between the currently selected important points. If \( L < \left\lfloor {\beta (g - 2)} \right\rfloor \), all the extremes are selected as important points and the remaining g − 2 − L important points are obtained using Formula (2).

Here we give an illustration of important points. Suppose that Fig. 1 shows a sequence of a time series and six important points are required, and \( \beta = \frac{1}{2} \). First of all, the beginning point b1 and end point b2 are selected as indicated by the yellow circles, and then we need to select two extreme points as important points according to step 1 of Definition 4. Firstly, e1 is selected and then e2 is selected according to Formula (1) as indicated by the red circles. Now we have selected all extreme points with the number given by \( \left\lfloor {\beta (g - 2)} \right\rfloor = 2,\,{\text{where}}\,,g = 6,\,\beta = \frac{1}{2} \), thus the rest of the extreme points m1, m2 and m3 cannot be selected as important points. With this situation, we need then to select two additional points as important points according to the step 2 of Definition 4 to ensure none of the differences are too large. The largest difference in Z values between neighbouring points is between b1e1. So a1 is selected as an important point, and then a2 is selected according to Formula (2) since after a1 has been added the largest difference in Z values is between a1e1. Points a1 and a2 are indicated by the green circles. Since large differences between points will affect feature extraction, the six important points identified should be more suitable for this purpose.

Fig. 1
figure 1

Illustration of important points

Definition 5

Piecewise linear representation (PLR) of time series based on important points (Yan et al. 2013)

Given a time series, T = [(Z1t1), (Z2t2), …, (Zntn)], where the set of important points is, T′ = [(Z1′, t1′), (Z2′, t2′), …, (Zm′, tm′)], where \( Z_{1}^{'} = Z_{1}^{{}} ,Z_{m}^{'} = Z_{n}^{{}} \,{\text{and}}\,m < n \), then a PLR of T can be obtained by first defining a set of functions: \( T_{l} = \left( {f_{1} ,f_{2} , \ldots , f_{m - 1} } \right) \), where fj represents a linear fitting function between the points (Zj′, tj′) and (Zj+1′, tj+1′). The PLR of T is obtained by replacing each point in T with the point from the function fj corresponding to the same time point. The fitting sequence can be expressed as follows: T″ = [(Z1″, t1), (Z2″, t2), ···, (Zn″, tn)]. In this paper set T′ represents the set of important points, and T″ represents the set of fitting sequences.

Definition 6

(Fitting error of PLR) Having defined the fitting sequence T″ which has the same size with original sequence T, the fitting error between the fitting sequence and original sequence T is defined as follows:

$$ Err = \sqrt {\sum\nolimits_{i = 1}^{n} {\left( {Z_{i} - Z_{i}^{''} } \right)^{2} } } $$
(3)

where n is the length of original sequence, Zi and Zi″ respectively express the original sequence value and fitting sequence value at the same time ti. A smaller fitting error shows that the fitting sequence better reflects the original sequence.

According to Definition 5, we develop a segmentation method called PLR_IP, using the important points to segment the time series. Now we further define four features that will be used to characterize subsequences, each of which corresponds to a sliding window, for anomaly detection as follows.

Definition 7

The maximum angle of a subsequence

Let, T′ = [(Z1′, t1′), (Z2′, t2′), …, (Zlen′, tlen′)] be the important points in a given subsequence, where len is the number of important points; for simplicity we express this as, T′ = [I1I2, …Ilen)]. Define θi to be the angle between the vectors Vi−1,i and Vi,i+1, where Vi−1,i represents the vector from I(i−1) to Ii and Vi,i+1 represents the vector from Ii to Ii+1, for i = 2, 3, ···, len − 1. θi is called the degree of anomaly of the ith important point as shown in Fig. 2. The maximum angle of the subsequences corresponding to a sliding window is denoted Spθ and is given by

$$ S_{\theta }^{p} = \hbox{max} \left\{ {\left| {\theta_{2} } \right|,\left| {\theta_{3} } \right|, \ldots ,\left| {\theta_{len - 1} } \right|} \right\}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \theta_{i} \in ( - \pi ,\pi ) $$
(4)
Fig. 2
figure 2

The angle or degree of anomaly of the important point, where I1, I2 and I3 are important points according to Definition 4

Note that there is no degree of anomaly defined for the first and last important points of a subsequence. The angles are decided by important points; and the fitting data points don’t affect the angles.

Definition 8

Number of important points in a subsequence

The number of the important points in a subsequence, denoted as SpN, is defined as

$$ S_{N}^{p} = \left| {\left\{ {\left( {Z_{\alpha }^{'} ,t_{\alpha }^{'} } \right) \in T'\left| {t_{p} \le t_{\alpha }^{'} \le t_{p + m - 1} } \right.} \right\}} \right| $$
(5)

where T′ = [(Z1′, t1′), (Z2′, t2′), …, (Zlen′, tlen′)] where \( t_{1}^{'} < t_{2}^{'} < \cdots < t_{len}^{'} . \)T′ is a set of important points of the time series T. SpN represents the number of important points in Cp computed by Definition 4.

Definition 9

Average value of a subsequence

The average value of Z, denoted as Spμ, is defined as

$$ S_{\mu }^{p} = \frac{1}{m}\sum\limits_{i = p}^{p + m - 1} {Z_{i} } $$
(6)

where p is the beginning position index of the sliding window and p + m − 1 is the end position index of the sliding window. Zi represents the value of the data points in the sliding window Cp.

Definition 10

The maximum difference between values of important points in a subsequence

$$ S_{\sigma }^{p} = \hbox{max} \left\{ {h_{2} ,h_{3} , \ldots ,h_{len} } \right\}{\kern 1pt} {\kern 1pt} {\kern 1pt} $$
(7)

where \( h_{i} = \left| {Z_{i}^{'} - Z_{i - 1}^{'} } \right| \) is the difference between (Zi′, ti′) and (Zi−1′, ti−1′) with respect to Z, where T′ = [(Z1′, t1′), (Z2′, t2′), …, (Zlen′, tlen′)] are the important points in the sliding window.

2.1.3 A weighted local outlier factor method

According to the features of the time series that have been defined above, we propose a new anomaly detection method called the “weighted local outlier factor”, which assigns different features with different weights, and then uses these weighted features for anomaly detection. The relevant definitions are given below.

Definition 11

The distance between two subsequences P and Q in the new feature space.

We have defined four features in Definitions 710, which give us a four dimensional feature space. We can then compute the distance between two different subsequences in this space. Supposing subsequence P is represented by the point (xpyplpmp) and subsequence Q by the point (xqyqlqmq) in the four dimensional feature space, where xylm represent the four features respectively and the number of subsequences n is determined by the size of the sliding window. The weighted Euclidean distance is defined as follows:

$$ {\text{wdist}}\,\left( {P,Q} \right) \equiv \sqrt {w_{1} \left( {x_{p} - x_{q} } \right)^{2} + w_{2} \left( {y_{p} - y_{q} } \right)^{2} + w_{3} \left( {l_{p} - l_{q} } \right)^{2} + w_{4} \left( {m_{p} - m_{q} } \right)^{2} } $$
(8)

where wi are weights, which are assigned to these four features and ∑4i=1wi = 1. In order to determine appropriate weights, we use the sum of the values of each feature and want to ensure that for a given feature, the larger its sum, the smaller its weight. This approach can avoid a feature with a large sum determining the result with other features being irrelevant. One way of achieving this is as follows:

$$ w_{i} = \frac{{\sum\limits_{j = 1}^{4} {Sum_{j} - Sum_{i} } }}{{3\left( {\sum\limits_{j = 1}^{4} {Sum_{j} } } \right)}} $$
(9)

where Sum1 = ∑nk=1|xk| for feature x and similarly for the other features y, l and m. The idea is that instead of using the normalized sum, i.e. \( w_{i} = \frac{{Sum_{i} }}{{\sum\nolimits_{j = 1}^{4} {Sum_{j} } }} \), we use the mean of the normalized sum of the other three features to ensure that the larger sums have the smaller weights. An empirical comparison between the weighted local outlier factor and the local outlier factor is presented in Sect. 3.

Definition 12

The k-distance of subsequence object P: kwdist(P) (Breunig et al. 2000)

Here each of the subsequences is viewed as one object which is represented by the four features xylm. For any positive number k, the k-distance of object P, denoted as kwdist(P), is defined as the wdist(P, O) (see Definition 11) between P and an object O ∊ D, where D is the set of subsequence objects such that:

  1. 1.

    For at least k objects \( O' \in D\backslash \{ P\} \) it holds that wdist(PO′) ≤ wdist(PO), and

  2. 2.

    For at most k − 1 objects \( O' \in D\backslash \{ P\} \) it holds that wdist(PO′) < wdist(PO)

These constraints are defined for the k-distance of object P which represents the distance between P and the kth nearest object O. Figure 3 shows the k-distance of subsequence object P. The definition of the reachability distance of an object is given as follows:

Fig. 3
figure 3

The k-distance of subsequence object P: kwdist(p)for k = 4

Definition 13

The k-weighted local reachability densities of subsequence object P (Breunig et al. 2000):

$$ wlrd_{k} (P) = \frac{k}{{\sum\limits_{Q \in kw(P)}^{{}} {reach{ - }wdist_{k} (P,Q)} }} $$
(10)

where \( kw\left( P \right) = \left\{ {Q \in D\backslash \{ P} \right\}:wdist\left( {P,Q} \right) \le kwdist\left( P \right)\} \),\( reach{ - }wdist_{k} (P,Q) = \hbox{max} \left\{ {kwdist(Q),wdist(P,Q)} \right\} \). We can then give the definition of the weighted local outlier factor of an object P based on the reachability distance of an object as follows:

Definition 14

k-weighted local outlier factor of an object P (Breunig et al. 2000)

$$ WLOF_{k} (P) = \frac{{\frac{1}{k}\sum\limits_{Q \in kw(P)}^{{}} {wlrd_{k} (Q)} }}{{wlrd_{k} (P)}} $$
(11)

According to Definition 14, we can get the k-weighted local outlier factor of each of the subsequence objects P, and the larger the value of the k-weighted outlier factor, the larger the anomaly. From here on this will simply be referred to as the weighted outlier factor, where it is dependent on a constant k.

2.2 Anomaly detection algorithm based on weighted local outlier factor

2.2.1 Selection of important points

Based on Definition 4, we present pseudo-code for selecting important points, as shown in Algorithm 1. Therefore, we segment the time series into g − 1 segments using the g important points. The description of this method is as follows.

figure a

2.2.2 A new method based on weighted local outlier factor

The proposed anomaly detection algorithm is based on the weighted local outlier factor as shown in Algorithm 2. It involves the following main steps:

  • Step 1 Uniform scaling. This operation can enlarge or shrink data points by scaling them into the range of 0 and 1.

  • Step 2 Smooth the data using the locally weighted scatter plot smoothing. In order to find the extreme points, we must smooth the original data set to avoid finding too many extreme points.

  • Step 3 Selection of important points. We select the important points according to Formula 1 and Formula 2 in Definition 4. The selection of important points is shown in Algorithm 1.

  • Step 4 Compute the features of subsequences. (1) The maximum angle of the subsequences, (2) the number of important points in the subsequences, (3) the average of the subsequences, and (4) the maximum difference between values of important points of the subsequences.

  • Step 5 Compute the weighted local outlier factors. Here we compute the weighted local outlier factors of each subsequence based on Definition 14. And then we rank these weighted local outlier factors and the larger the value of the k-weighted outlier factor, the larger the anomaly.

At the end of the process, the weighted local outlier factor of each subsequence is outputted; the larger values of the weighted outlier factors represent larger anomalies. We will show the sample largest values of the weighted outlier factor of subsequences over different data sets in Sect. 3.

2.2.3 Metrics for measurement

Huang (2013) introduced two metrics, which will be used in this study to measure the performance of the anomaly detection algorithms. Suppose the dataset D of n objects contains dk true anomalies. We use our proposed method to find anomalies that would be ranked within the top 10. Let mk be the number of true anomalies which are detected by our proposed method in D. Then, we define the accuracy measure of anomaly detection as follows:

$$ {\text{Accuracy}} = \frac{{m_{k} }}{{d_{k} }} $$
(12)

The second measure is called “RankPower” also introduced in Huang (2013). Suppose Ri denotes the rank of the ith true anomaly. Then,

$$ {\text{RankPower}} = \frac{{m_{k} (m_{k} + 1)}}{{2\sum\limits_{i = 1}^{{m_{k} }} {R_{i} } }} $$
(13)

Larger values of the two metrics imply better performance.

3 Experimental results

Since we are using the sliding window method to obtain the subsequences, we need to set several parameters before conducting an evaluation. We obtain the maximum anomaly values by searching from a minimum value of k = 5 to maximum k = 20 with a step = 1 for the proposed k-weighted local outlier factor method. We use the important points to segment the time series for piecewise linear representation. In Sect. 3.1 we vary the number of important points to evaluate the effect of the piecewise linear representation, and set it to 10% of the length of the time series in Sect. 3.2. The sliding window method needs to specify the size of window. Here we set the window sizes to be larger than the time period of the system in time series data in order to find anomalies. We also did the comparison experiments for 50% smaller and 50% larger than our selected window sizes in Sect. 4. In terms of selecting the extreme points and additional points, we set the parameter β with a value of 1/2.

The experiments start by obtaining the subsequences and selecting important points with the parameter β, by sliding a window of length w across the time series T and then obtaining the features for each of the subsequences, and finally computing the weighted local outlier factor for each subsequence. Note that the index of subsequences goes from 1 to (n − w) + 1. The experiments using the piecewise linear representation is based on important points on the 17 data sets as shown in Table 1, which were downloaded from the website (www.cs.ucr.edu/~eamonn/).

figure b
Table 1 Compared results of fitting error

3.1 Experimental results of piecewise linear representation based on important points (PLR_IP)

This section reports the evaluation results on the important points (PLR_IP) to obtain the subsequences. Table 1 presents a summary of some statistics about the 17 data sets used in this work for the comparison between PLR_IP and piecewise linear representation based on the piecewise aggregate approximation (PLR_PAA). In the evaluation, the number of segments over these data sets is determined by the number of important points, from 40 to 100 which is 8–20% of the data points for a data set containing 500 points. In the rest of the experiments, we set the number of important points as 10% of the data points. If the length of a data set is larger than 500, we separate the data set into several segments, each of them consisting of 500 data points. If there are less than 500 data points in the last segment, it will be combined with the preceding one as illustrated in Column 3 of Table 1. For example, 500*6 + 750 in the first row of Table 1, the last segment is 250, which is combined with the previous segment with 500 data points. We then compute the average fitting error. These data sets will then be used to detect anomalies in the following sections. We compute the average fitting error of PLR_IP and average fitting error of PLR_PAA for the different segment numbers (40–100) which is the number of segments of PLR_IP and the number of intervals of PLR_PAA. Figure 4 shows the experimental results for the ECG stdb_308_0 dataset, while the results for all the datasets, which are averaged over the number of segments, are shown in the last two columns in Table 1. We used the t test to examine the differences between the fitting errors of PLR_IP and PLR_PAA over all the data sets. The single side paired t test value is 0.22, which indicates that the difference between the PLR_IP and PLR_PAA errors over these data sets is not statistically significant. However, as Table 1 shows, the PLR_IP method indeed gets less fitting error on 9 data sets.

Fig. 4
figure 4

Comparison results for ECG stdb_308_0(1:500)

Examining all the data sets, we find that PLR_IP has larger fitting errors for the data sets that have too many peaks such as Space Shuttle Marotta Valve Series and Respiration data set. On the other hand, PLR_IP has smaller fitting errors for data sets with fewer peaks such as Aerospace L-1t and stdb_308_0. Overall, the PLR_IP method can effectively fit these sequential datasets.

3.2 Anomaly detection in electrocardiograms

Electrocardiograms (ECGs) are time series data recording the activities of the heart, which are detected by electrodes attached to the surface of the skin and recorded or displayed by a device external to the body. Given their importance, many annotated data sets have been collected. This experiment has conducted evaluation on three ECG datasets, chfdb_chf01_275, chfdb_chf13_45590 and stdb_308_0 as shown in Figs. 56 and 7, respectively. Figures 5 and 6 are very simple and it is easy to find the anomaly but Fig. 7 shows very complicated ECG data where it is difficult to find the anomaly. Figures 56 and 7 show the original time series (blue line) and the esult using PLR_IP (red line). Table 2 shows the experimental results of the ECG chfdb_chf01_275 using the WLOF and LOF (vector) method which uses the vector of all values of the original subsequence as the input to the LOF method (Breunig et al. 2000), in which the window size is set to w = 400 and the number of important points is set to m = 375. In this study, we only present the results detected in the top 10 subsequences at most and rank them based on the WLOF values. As seen from Table 2 the strongest outlier is in subsequence 1991. Because the window size is 400, the strongest outlier data point sequence is thus in between 1991 and 2390, and the second strongest outlier data point sequence is 2163–2560. The rank 1 and rank 2 overlap with the anomaly area as shown by the yellow circle in Fig. 5. The anomaly is also detected by the LOF (vector) method in rank 1.

Fig. 5
figure 5

The time series anomaly found in electrocardiogram chfdb_chf01_275 (marked in yellow circle)

Fig. 6
figure 6

The time series anomaly found in chfdb_chf13_45590 (marked in yellow circle)

Fig. 7
figure 7

The time series anomaly found in electrocardiogram stdb_308_0 (marked in yellow circle)

Table 2 Results of the ECG chfdb_chf01_275 for window size = 400

Table 3 shows the results of the ECG chfdb_chf13_45590 using the WLOF and the LOF (vector) method, in which the window size is set to w = 250 and important point number is set to m = 375. The strongest outlier is subsequence 2728. Because the window size is 250, the strongest outlier data point sequence is in between 2728 and 2977, in which a possible anomaly area is presented with the yellow circle in Fig. 6. The anomaly is not detected by LOF (vector) method until rank 4. Table 4 shows the results of the ECG stdb_308_0 using the proposed WLOF and LOF (vector) method, with the window size w = 400 and important point number m = 550. The strongest outlier is in subsequence 1939. Because the window size is 400, the strongest outlier data point sequence is thus in between 1939 and 2388, and the rank 3 also includes the anomaly area indicated with the yellow circle as shown in Fig. 7. The anomaly is detected by the LOF (vector) method in rank 6.

Table 3 Results of the ECG chfdb_chf13_45590 for window size = 250
Table 4 Results of the ECG stdb_308_0 with the window size = 400

3.3 Anomaly detection in space telemetry

Figures 8 and 9 show two Space ShuttleMarotta Valve series that were annotated by a NASA engineer (Keogh et al. 2005). In Fig. 8, the expert annotated the anomaly as “Poppet pulled out of the solenoid before energizing”. In Fig. 9, the expert annotated the anomaly as “Poppet pulled significantly out of the solenoid before energizing”. Tables 5 and 6 show the results of the Space Shuttle Marotta Valve Series 1 and Space Shuttle Marotta Valve Series 2 using the WLOF and LOF (vector) methods, where the window size is set to w = 500 and important point number m = 500. The strongest outlier subsequence for series 1 according to WLOF starts at 2098 and because the window size is 500 it is therefore the subsequence from 2098 to 2597, which overlaps with the anomaly area as shown by the yellow circle in Fig. 8. The strongest outlier subsequence for series 2 according to WLOF is 369–868 which does not overlap with the anomaly area as shown by the yellow circle in Fig. 9. However, the 8th strongest outlier subsequence for series 2 is 4030–4529 which does overlap with the anomaly area. Note that none of the subsequences identified by the LOF method in Tables 5 and 6 overlap with the corresponding anomaly areas in Figs. 8 and 9.

Fig. 8
figure 8

The time series anomaly found in space shuttle Marotta valve series 1 (marked in yellow circle)

Fig. 9
figure 9

The time series anomaly found in space shuttle Marotta valve series 2 (marked in yellow circle)

Table 5 Results of the space shuttle Marotta valve series 1 for window size = 500
Table 6 Results of the space shuttle Marotta valve series 2 for window size = 500

3.4 Anomaly detection in patients’ respiration

The Respiration dataset is a time series showing a patient’s respiration (measured by thorax extension). The dataset consists of manually segmented data labeled with ‘awake’ and ‘sleep’ (Keogh et al. 2005). Figure 10 shows the original time series of patients’ respiration (blue line) and the segmented result (red line). As Fig. 10 shows, there are three different stages (0–2950, 2951–3300, and 3301–4000). Table 7 shows the detected results on the Respiration dataset using the WLOF, with the settings of the window size w = 150 and important point number m = 400. The strongest outliers are subsequences 2908 and 2909, so given the window offset 150, the strongest outlier data subsequence is thus 2908–3057, which includes the change from the first stage to the second stage, and the rank 7 subsequence is 3390–3539 which is just above the change from the second stage to the third stage. The LOF (vector) method finds relevant subsequences in all ranks from 1 to 7, but they all correspond to the same anomaly from the second stage to the third stage, with the subsequences all being just below the boundary between these stages.

Fig. 10
figure 10

The original time series of patients’ respiration (blue line) and the segmented result (red line)

Table 7 Results of the patients’ respiration for window size = 150

3.5 Anomaly detection in aerospace data

This section presents the experimental results of the anomaly detection on the Aerospace time series data set (Keogh et al. 2004) as shown in Figs. 11121314 and 15. Figure 11 shows the data set L-1j which is Impulse with one impulse negated inversion. Table 8 shows the results of the Aerospace L-1j, with the window size setting of w = 30 and important point number m = 100. The strongest outlier is subsequence 480, thus the segment 480–509 overlaps with the anomaly of Aerospace L-1j with one negative impulse as shown in Fig. 11. The anomaly also is detected in rank 1 using the LOF (vector) method. The same parameters for Aerospace L-1b sequence with one impulse amplitude doubled as shown in Fig. 12 and Table 9. The strongest outlier is subsequence 471, thus the segment 471–500 overlaps with the anomaly with one impulse amplitude doubled as shown in Fig. 12. The anomaly is also detected in rank 1 using the LOF (vector) method.

Fig. 11
figure 11

The time series anomaly found in Aerospace L-1j data (marked in yellow circle)

Fig. 12
figure 12

The time series anomaly found in Aerospace L-1b data (marked in yellow circle)

Fig. 13
figure 13

The time series anomaly found in Aerospace L-1p data (marked in yellow circle)

Fig. 14
figure 14

The time series anomaly found in Aerospace L-1q data (marked in yellow circle)

Fig. 15
figure 15

The time series anomaly found in Aerospace L-1t data

Table 8 Results of Aerospace L-1j data set for window size = 30
Table 9 Results of Aerospace L-1b data set for window size = 30

Figure 13 shows Aerospace L-1p sequence which is the sine with phase advance. Table 10 shows the results of the Aerospace L-1p sequence using the WLOF and LOF (vector) method, where the window size is set to w = 30 and important point number m = 100. The strongest outlier is subsequence 481, the segment 481–510 overlaps the anomaly of Aerospace L-1p as shown in Fig. 13. The LOF (vector) method cannot detect the anomaly in ranks 1–10. Figure 14 shows the Aerospace L-1q sequence which is the sine with phase delay. Table 11 shows the results of the Aerospace L-1q sequence using the WLOF and LOF (vector) methods, with the window size setting of w = 30 and segment number m = 100. The strongest outlier subsequence according to WLOF is 503–532 which does not overlap with the anomaly area as shown by the yellow circle in Fig. 12. However, the 2nd strongest outlier subsequence is 439–468 which does overlap with the anomaly area. This anomaly is in rank 1 for the LOF (vector) method.

Table 10 Results of Aerospace L-1p data set for window size = 30
Table 11 Results of Aerospace L-1q data set for window size = 30

Figure 15 shows the Aerospace L-1q sequence which is the sine with shot noise. The data set has three anomalies with one cycle with a few large magnitude values. Table 12 shows the results of the Aerospace L-1t sequence with the window size setting of w = 30 and important point number m = 100. The strongest outlier is subsequence 471, the segment is 471–500 which contains one of the anomalies in Aerospace L-1t as shown in Fig. 14. Ranks 2, 3 and 4 correspond to the second anomaly and ranks 5 and 6 to the third anomaly. The LOF (vector) method obtains similar results for this data set.

Table 12 Results of Aerospace L-1t data set for window size = 30

The experimental results for the other data sets given in Table 1 are shown in Table 13. There are two anomalies in the Lighting2_TEST data set, which are detected in rank 1 and rank 2. There is only one anomaly for each of the other data sets and the anomalies have been detected in rank 1 in four of the data sets and rank 3 in the other one. The results are compared with method LOF (vector) in Table 14.

Table 13 The experimental results for 5 data sets
Table 14 Experimental results of different window sizes and methods

4 Discussion

Many rank based anomaly detection algorithms have been proposed such as LOF, Connectivity-based outlier method (COF), and INFLuential measure of outlier by symmetric relationship method (INFLO) (Huang 2013). They have been used to detect anomalies in several public benchmark data sets. Some anomalies can be detected in rank 1, but they failed to detect some anomalies (Huang 2013). The empirical results demonstrate that our WLOF method outperforms the LOF method over the seventeen datasets in the different settings of the window and important points. Here we look at the effect of different parameters. The important point number was set at 10% of the number of data points. We set the parameter window size according to the features of time series data, which should be larger than the length from one peak to next peak in time series data. In order to examine the effect of our feature extraction method, we also obtained results with the LOF method using the features from out method so it could be compared with the LOF method using the vector of all original data points, which was used in Sect. 3. The experimental results are shown in Table 14.

We also examine different window sizes for the WLOF, NLOF, LOF and LOF (vector) methods. The difference between WLOF and NLOF is that instead of constructing four features with different weights, NLOF normalizes the time series data by just mapping each data point into the range [0, 1]. The difference between LOF and LOF (vector) is that the input values of the LOF method are the four features of subsequences obtained by our feature extraction method, whereas the input values of LOF (vector) is the vector of all original values of the subsequences. As Table 14 shows, all the anomalies can be detected by the WLOF method using the different window sizes in Sect. 3, and only one anomaly cannot be detected by the LOF method using our feature extraction method in rank 1–10 as shown in Table 14, however by contrast, 7 anomalies cannot be detected by the LOF (vector) method using the original data point values as the features. Therefore, this result illustrates that our feature extraction method and weighting method have achieved better performance than the LOF methods. For these window sizes, the WLOF can find 100% of the anomalies, the LOF method can find 95% of the anomalies and the LOF (vector) can only find 65% of the anomalies. The WLOF also obtains better rankings for most of these data sets such as data sets 1, 12 and 15, obtaining the best RankPower with a value of (5.12) compared to (3.39) for LOF and (2.76) for LOF (vector). To examine the effect of other window sizes, as Table 14 shows, reducing the window sizes by a half compared to Sect. 3, 11 anomalies cannot be detected by the LOF (vector) method (just 45% detection rate of the anomalies), 9 anomalies cannot be detected by the LOF and 9 anomalies cannot be detected by the WLOF (55% detection rate), but there are two ranked at 10 by the LOF method. RankPower also can reflect the performance of algorithms; the WLOF obtains a better RankPower (1.83) than the LOF and LOF (vector) methods, which have RankPower of 1.32 and 0.96 respectively. For one and half times the window size in Sect. 3, 7 anomalies cannot be detected by the LOF (vector) method and LOF method (they find 65% of the anomalies) but 2 anomalies are detected in rank 10 by LOF (vector), and 5 anomalies cannot be detected by WLOF (it finds 75% of the anomalies). And the WLOF also obtains a better RankPower (2.35) than the LOF and LOF (vector) methods, which have RankPower of 2.07 and 2.22 respectively.

We also carried out experiments with NLOF over these datasets. Unlike NLOF, which normalizes the time series, our new weighted method WLOF takes account of the relationship between features by using weights when aggregating all feature together. Table 14 also shows the experimental results obtained using the NLOF, which achieves accuracies of 95%, 55%, and 85% and a RankPower of 2.32, 1.47, and 2.15 for the different windows sizes, respectively. As Table 14 also shows, the WLOF method can obtain accuracies of 100%, 55%, and 75% and RankPower of 5.12, 1.83, and 2.35 for the different windows sizes, respectively. In other words, WLOF can obtain better RankPower than NLOF. As Table 15 shows, the accuracy of finding the anomalies is 100% for β = 1/2, 80% for β = 2/3 and 75% for β = 3/4. These accuracies are better than the results for LOF (vector). Overall, the experimental results demonstrate that our method can improve the performance of anomaly detection over the 17 data sets with the suitable window sizes in comparison with the LOF methods.

Table 15 Experimental results for WLOF with different values of the parameter β

Now we compare our WLOF with the HOT SAX method which was proposed by Keogh et al. (2005). The authors used their method to represent time series data and then find the discords based on the distance between subsequences. This method also needs to set several parameters. Window size for subsequences is needed and the parameter nseg, which is the number of symbols, is used to represent the subsequence. The element number of the alphabet which is set to 10 in this paper, represents that the HOT SAX method using the alphabet “abc, …, j” to represent the subsequence, more details can be found in reference (Keogh et al. 2005). Table 16 shows the experimental results. The accuracy of all different window sizes is 75% and the RankPower is 2.61 for the window sizes in Sect. 3 and 2.93 and 4.62 for window sizes 50% smaller and larger respectively. Therefore, the WLOF obtained greater accuracy for the window sizes set in Sect. 3 and the same results for window sizes 50% larger, but a lower accuracy for window sizes 50% smaller. While the HOT SAX method has better RankPower results for the smaller and larger window sizes, WLOF obtains the best RankPower (5.12) compared to the other methods for the window sizes set in Sect. 3 and this is better than any of the results for other methods at any of the window sizes considered.

Table 16 Experimental results of different window sizes and methods

In respect of computational complexity, we compare the WLOF and HOT SAX methods. Suppose n is the size of data sets. Keogh et al. (2005) have pointed out that the complexity of their method is O(n2), although they also proposed heuristics to reduce complexity (Keogh et al. 2005), and they later showed an algorithm that can exactly find discords in just O(n) time, with “two linear scans through the database and a limited amount of memory based computation” (Yankov et al. 2007). The WLOF and the LOF have the same complexity, but differ from that of HOT SAX. Breunig et al. (2000) have analyzed the complexity in. The complexity of WLOF and LOF is as follow:

$$ T(n) = O(n*t_{k} ) $$
(14)

where tk is the time for a k-nearest neighbour search

For low-dimensional data, the complexity is O(n). For medium to medium high-dimensional data the complexity is O(n * log n). For extremely high-dimensional data, the complexity is O(n2).

With respect to the effect of the weighted local outlier factor, Fig. 16 shows the results of important points selection for ECG data set chfdb_chf13_45590 whose parameters are given in Sect. 3.2. The symbol‘*’ represents the extreme points and ‘o’ represents the additional important points computed by Formula 2. As Fig. 16 shows, the selected important points can segment the time series data and this can help to obtain the four features defined. Table 17 shows the four feature values for the first seven subsequences of chfdb_chf13_45590 and we also find Sum1 = 647, Sum2 = 77,224, Sum3 = 3915 and Sum4 = 2569, which are obtained as noted after Formula (9). Notice that the feature ‘the number of important points in the subsequence’, Sum2 is much larger than the other features, which would then dominate the experimental results of the LOF method that uses the four features as input, and so it is unable to find the anomaly in chfdb_chf13_45590 near the data point 2700 as shown in Fig. 6. Therefore, we used our WLOF method to allocate different features with different weights. We use the sum of the values of each feature to ensure that for a given feature, appropriate weights are used as given in Formula 9. Table 14 shows, for chfdb_chf13_45590, the WLOF method can find the anomaly in rank 1. In summary, the WLOF can make use of all the features in anomaly detection.

Fig. 16
figure 16

The selected important points of time series ECG chfdb_chf13_45590

Table 17 4 Feature values for the first 7 subsequences of chfdb_chf13_45590

To investigate the discriminability of the four features, we have carried out more experiments on the combinations of these four features, analyzing the effect of combinations of any three features. Table 18 shows the results of any three features. Features 2, 3, 4 can obtain best results with accuracy 100% and RankPower (3.28). However, as shown in Table 14, the addition of feature 1 results in a higher value of RankPower (5.12). The experimental result of combining feature 1, 2, 4 is 0% of accuracy, which indicates that feature 3 “average of the subsequence” is playing a very important role in anomaly detection. Therefore, the use of the four features together can effectively identify the anomalies, but including more features does not necessarily improve the results. For example, in the case of the LOF (vector), where the vector of all values of the original subsequence are used as the input features for the LOF method, the accuracy is low as shown in Table 14.

Table 18 How any tree features affect the results

This section has discussed the experimental results using our WLOF method comparing with LOF, NLOF, LOF (vector) and HOT SAX methods. The experiments show the new features can work better than LOF (vector), and our weighting method can work better than the Normalization method as shown in Tables 14 and 16. The effect of the proposed new features is presented in Table 18. The assessment of different parameter values of β is given in Table 15, with the experimental results demonstrating that the WLOF method can obtain better accuracies than the LOF (vector) for different β values. From all the experiments, it can be found that our important points, features and weighting method can obtain better accuracy and RankPower.

5 Conclusion and future work

In this paper, we have proposed a new WLOF method along with three novel components. The component PLR_IP, which consists of extreme points and additional points, can effectively fit the original time series with the appropriate values of the parameter β. The four features, three of which are defined on the basis of the PLR_IP method, represent different aspects of time series data as the input for the WLOF method. Finally, the weighting schema, which gives the four features with different weights, has made effective use of the discriminant power of all the features together. These novel components effectively characterize the time series data and underpin the WLOF, with the experiments over the seventeen datasets illustrating their effectiveness in anomaly detection.

The comparison between our weighting method and the normalization method demonstrates that the PLR_IP method can effectively extract the features of time series and assist the WLOF method in detecting the anomalies of the time series data. The experimental results also show that the WLOF method can obtain better results over the 17 data sets than the LOF method, NLOF method, LOF (vector) method and comparable with HOT SAX method. These results indicate that using our feature extraction method can improve the performance of anomaly detection of the LOF method, and our weighting method is better than the normalization method.

One particular issue with the proposed approach is that a number of parameters need to be set prior to the application to anomaly detection. To overcome this shortcoming in practice, we plan to conduct a further study in line with the current research results, including (1) investigating other features for anomaly detection analysis, for example considering the geometrical information of data points; (2) considering a new weighting method which can capture the relationship of all features, and (3) revising the WLOF model to reduce the number of parameters required.