Abstract
An increasing amount of unlabeled time series data available render the semi-supervised paradigm a suitable approach to tackle classification problems with a reduced quantity of labeled data. Self-labeled techniques stand out from semi-supervised classification methods due to their simplicity and the lack of strong assumptions about the distribution of the labeled and unlabeled data. This paper addresses the relevance of these techniques in the time series classification context by means of an empirical study that compares successful self-labeled methods in conjunction with various learning schemes and dissimilarity measures. Our experiments involve 35 time series datasets with different ratios of labeled data, aiming to measure the transductive and inductive classification capabilities of the self-labeled methods studied. The results show that the nearest-neighbor rule is a robust choice for the base classifier. In addition, the amending and multi-classifier self-labeled-based approaches reveal a promising attempt to perform semi-supervised classification in the time series context.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In the time series field, the semi-supervised learning (SSL [12]) paradigm has received a lot of research attention during the past decade. As cheap sensors of all kinds become more and more available, vast amounts of unlabeled time series data are generated. By contrast, the cost related to the labeling process makes it often unfeasible to obtain a fully labeled training set. In this situation, SSL is a good solution to improve the learning accuracy taking advantage of the unlabeled data in conjunction with a small set of labeled data. Specifically, semi-supervised classification (SSC) focuses on training a classifier such that it outperforms a supervised classifier trained on the labeled data alone. In the time series domain, SSL has found a wide range of applications such as the classification of flying insects [5], web information extraction [23] and failure prediction in oil production [40]. In this work, we tackle SSC problems in the time series classification context.
Time series data are characterized by a high dimensionality and its numerical and continuous nature. Therefore, a special treatment must be considered to deal with time series classification [25]. A first category of proposals, called feature-based approach [7, 11, 17, 26, 62], transforms the original time series into a new description space where conventional classifiers can be applied. Signal processing or statistical tools are commonly used to extract features from the original raw data. A second category [21, 34, 46, 52, 53, 64] focuses on customizing or developing classifiers specifically designed for time series data. This category, which includes the instance-based approach, is mostly based on the selection of an appropriate representation of the time series and a suitable measure of dissimilarity. Several representations and dissimilarity measures have been proposed to deal with the time series classification problem including: spectral approaches [22], autocorrelation function [2] and elastic measures [13, 41, 54]. Our paper considers this second approach.
Several SSC approaches have been applied to the time series classification problem. The work of Marussy and Buza [43] uses the cluster-then-label approach by a constrained hierarchical single link clustering method. The work of Frank et al. [24] applies a similar approach with a similarity measure called geometric template matching. The applicability of graph-based methods to time series classification is addressed in various works [18, 19]. The classical semi-supervised support vector machines method is extended to tackle time series classification by Kim [35].
Another family of SSC methods, denoted self-labeled techniques [57], aims to enlarge the original labeled set using the most confident predictions to classify unlabeled data. In contrast to the previous mentioned approaches, self-labeled techniques do not make any special assumptions about the distribution of the input data. Self-training [67] and co-training [9] are the most popular self-labeled techniques. Both approaches have been applied in a time series context.
Self-training is a wrapper method that iteratively classifies the unlabeled examples, assuming that its more accurate predictions tend to be correct. In the time series domain, the self-training technique is mostly applied to a particular case of SSC, called positive unlabeled learning, within which only examples from one class are available [6, 14, 30, 51, 61]. In conjunction with the self-training, the k-nearest-neighbor (kNN [1]) is typically used as the base learner as it has shown to be particularly effective for time series classification tasks [55, 58].
Co-training is a SSC method that requires two conditionally independent views which are sufficient for learning and classification. For each view, the unlabeled instances with highest confidence are selected and labeled to be turned into additional training data for the other view. The multi-view requirement of the co-training technique is typically too strong and difficult to meet in the time series domain where in most cases observations close together are correlated. The work of Meng et al. [44] applies a variant of co-training [29] which uses the hidden Markov model [49] and one-nearest-neighbor (1NN) as two different learners instead of the classical two views of the data.
There are various works in the literature [6, 14, 44, 51, 61] that focus on the SSC of time series, which involve self-labeled techniques. Self-training and co-training are the only self-labeled techniques that have been applied in a time series context so far, to the best of our knowledge. Our study broadens this approach to other self-labeled techniques, therewith gaining new insights and allowing for more detailed conclusions on this topic. Moreover, despite the successful application of 1NN to time series classification tasks, the use of different learning approaches as a base classifiers seems to be an under-explored area. These reasons motivate our paper, which has three main objectives as follows:
-
To explore the applicability of classical self-labeled techniques in the time series domain, as well as the use of other classification schemes as base learners in addition to the well-known 1NN.
-
To identify the best methods for each base learner under different ratios of labeled data and dissimilarity measures.
-
To determine the influence of the geometrical characteristics of time series datasets in the performance of the self-labeled techniques.
The rest of the paper is organized as follows. In Sect. 2, we provide definitions and the notation of SSC in the time series domain. Furthermore, we discuss the main characteristics of the self-labeled techniques and the base learners involved in this study. In Sect. 3, we introduce the experimental framework. In Sect. 4, we present the results obtained and discuss them. Finally, Sect. 5 concludes the paper.
2 Semi-supervised time series classification
In this section, we define the SSC of time series problem and the principal notation and definitions. Furthermore, we review the self-labeled techniques and the base learner methods involved in this study (Sects. 2.1, 2.2).
In SSC, the source dataset has two parts, L and U. Let L be the set of instances \(\{{x}_{1},{\dots },{x}_{l}\}\) for which the labels \(\{{y}_{1},{\dots },{y}_{l}\}\) are provided. Let U be the set of instances \(\{{x}_{l+1},{\dots },{x}_{l+u}\}\) for which the labels are not known. We follow the typical assumption that there is much more unlabeled than labeled data, i.e., \(u \gg l\). The whole set \(L \cup U\) forms the training set.
Throughout this study, each instance \({x}_{i}, i=1,\dots ,l+u,\) represents a univariate, real-valued and evenly spaced time series. In this case, the time series \(x_i=[p_1,p_2,\dots ,p_n]\) is considered a sequence of 1-dimensional data points.
Depending on the goal, we can categorize SSC into two slightly different settings [12], namely transductive and inductive learning. The former is devoted to predict the labels on the unlabeled instances U provided during the training phase. The latter aims to predict the labels on unseen testing instances. In this work, we delve into both settings aiming to perform an extensive analysis of the selected methods.
2.1 Self-labeled techniques
Self-labeled techniques follow a wrapper methodology using one or more supervised classifiers to determine the most likely class of the unlabeled instances during an iterative process. The base classifier(s) play(s) a crucial role in the estimation of the most confident instances of U. The main feature that distinguishes self-labeled methods is the way they obtain one or several enlarged labeled sets to efficiently represent the learned hypothesis from the training set. In the literature, there are several proposals that follow this approach which differ mainly in the following aspects:
-
Addition mechanism There is a variety of schemes in which the enlarged set can be formed. The most used ones are incremental and amending. The former adds, step-by-step, the most confident instances from U to the enlarged set. The latter allows rectifications to already labeled instances to avoid the introduction of noisy instances in the enlarged set(s).
-
Classification model Self-labeled techniques can use one or more base classifiers to establish the class of unlabeled instances. Single-classifier models assign to each instance the most likely class considering the used classifier. Multi-classifier models combine the hypothesis learned by several classifiers to estimate the class by agreement of classifiers or combination of the probabilities obtained by single-classifiers.
-
Learning approach Independently of the number of base classifiers, the number of learning methods is another important issue. The single learning approach can be linked with single- and multi-classifier models. By contrast, the multi-learning approach is intrinsically related to the multi-classifier model. In a multi-learning method, the different classifiers come from different learning methods.
-
Stopping criterion This is the mechanism used to stop the self-labeling process preventing the addition of labeled instances in L with a low confidence level. Often, a prefixed number of iterations is used as stopping criterion. Another criterion used is the occurrence of non-changes in the learned hypotheses during successive iterations of the self-training process.
Since each approach has its own benefits and drawbacks, we include in this study a representative sample of methods. The selection made is based on the results obtained in the extensive overview study of Triguero et al. [57], and it includes the following methods:
-
Standard self-training (SelfT [67]): is a single-classifier and single learning method that extends the set L with the most confident examples extracted and classified from U, during an iteratively and incremental process. The stopping criterion consists in a fixed number of iterations that can be adapted to the original size of U. Following a wrapper methodology, the base classifier used by self-training is considered as another parameter of the method.
-
Self-training with editing (SETRED [38]): is a self-training variant with a different addition mechanism. SETRED introduces a data editing method to filter the noise examples that has been labeled by the base classifier. For each iteration, the mislabeled examples are identified using the local information provided by the neighborhood graph [71].
-
Self-training nearest-neighbor rule using cut edges (SNNRCE [59]): is a variant of SETRED that includes a first stage where the most confident examples are added to L. In the next stage, the self-training standard is applied in combination with the 1NN rule as a base classifier. The iterative process stops when the expected number of examples in the minority class is reached, according to the distribution observed in L. In the final stage, the mislabeled examples are relabeled attending to the information provided by the neighborhood graph.
-
Tri-training (TriT [70]): is a variant of co-training that trains three instead of the traditional two classifiers. These classifiers have in common the same learning scheme. The diversity between the base classifiers is obtained through manipulating the original set L, for example, using Bagging [10]. For each iteration, the selected examples from U are labeled and added to the training set of a specific classifier only if there is agreement between the remaining classifiers and some conditions are satisfied. The stopping criterion is reached when the hypothesis of the three classifiers does not suffer any modification during a complete iteration.
-
Democratic co-learning (Democratic [69]): is a multi-classifier and multi-learning method. The specific number of classifiers and its learning scheme are established as arguments of the method. Initially, all classifiers are trained using the examples in L. For each iteration, a label for each unlabeled example is proposed via majority vote. If the classification provided by a classifier disagrees with most classifiers, in a particular example, then this example is included in the training set of the classifier. The iteratively process stops when the training sets of the classifiers do not suffer any additions during a complete iteration.
Table 1 summarizes the principal characteristics of the self-labeled methods selected.
The variety of stopping criteria, associated with the self-labeled methods, makes difficult in estimating the maximum number of iterations performed by each method. For simplicity, we focus the temporal analysis on the complexity related to the execution of each iteration in the main loop of the method. The algorithmic complexity is based on the current number of unlabeled examples (u) and labeled examples (l) at the beginning of the iteration. Table 1 includes the time analysis of each method. The functions \(T_\mathrm{t}\) and \(T_\mathrm{c}\) represent the time cost associated with an specific learning scheme in the task of training the model and classifying new instances, respectively. \(l'\) is the number of candidate examples selected to increase the set L, and \(l''\) is the resulting number of examples after the filtering process. In the case of the SETRED method, the construction of the neighborhood graph has a cubic complexity. For the analysis of the Democratic method, we take into account the existence of N different learning schemes.
2.2 Supervised approaches for time series classification
Different approaches have been used to face the time series classification problem such as kNN classifiers, decision trees (DT) or support vector machines (SVMs).
The kNN classifier has been widely applied in the time series context [28, 45]. This classifier approximates the confidence in terms of dissimilarity between instances. There are several distance measures presented in the literature that have been used for evaluating dissimilarity between time series: lock-step measures (Euclidean), feature-based measures (Fourier coefficients), model-based measures (autocorrelation functions), and elastic measures.
The construction of DT is another approach applied to time series classification. Yamada et al. [66] propose two binary split tests called the standard-example and the cluster-example. The former selects an existing instance as the standard time series, and the members of the child nodes are selected depending on their distances to this selected instance. The later split searches for two standard time series to bisect the set of instances. A similar idea is followed by Balakrishnan and Madigan [4] with a clustering-goodness criterion which searches for the pair of time series that best bisects the set of instances. In both works, the dynamic time warping (DTW [54]) distance is used. A new split criterion based on an adaptive metric that covers both behavior and value proximities is developed in Douzal-Chouakria and Amblard [21].
On the other hand, SVMs are a popular technique that has been applied to time series classification. The work of Pree et al. [47] compares several similarity measures used as kernel function in SVM. In contrast to other learning approaches, the performance of the SVM constructed with Euclidean distance significantly outperforms those obtained using DTW distance. The reason of this behavior has been analyzed in multiple works [37, 68]. It is caused by the indefiniteness of the kernel constructed with DTW. The use of classical recursive elastic distances to construct recursive edit distance kernels is addressed in Marteau and Gibet [42]. The kernels constructed in this way are positive definite if some sufficient conditions are satisfied. Moreover, the construction of a weighted DTW kernel to classify time series data is proposed in Jeong and Jayaraman [33].
In this study, we use as base learners three instance-based classifiers representative for those classification approaches, namely kNN, DT and SVM.
3 Experimental framework
This section presents the information related to the datasets involved in the study in Sect. 3.1. The performance measures and the configuration parameters of the algorithms used are addressed in Sects. 3.2 and 3.3, respectively.
3.1 Datasets
The experimentation is based on 35 standard classification datasets taken from public available repositories [15, 60]. Table 2 summarizes the main properties of the selected datasets. The datasets involved in this study contain between 56 and 9236 instances, the time series length ranges from 24 to 1882, and the number of classes varies between 2 and 14. For each dataset, the time series are z-normalized, following the recommendation in Rakthanmanon et al. [50].
The datasets are randomly divided using a fivefold cross-validation procedure. Each training partition (4/5 of the total set of examples) is randomly divided into two sets: L and U, of labeled and unlabeled (i.e., the labels are withheld and not available to the algorithm) examples, respectively. Following the approach of Triguero et al. [57] and Wang et al. [59], we do not attempt to keep the class proportion in the L and U sets the same as in the whole training set. The class label of the instances selected to form the set U is removed. We make sure that every class has been represented in L.
With the purpose of studying the influence of the amount of labeled data, we take different ratios when dividing the training set. In our experiments, three ratios are used: 10, 20 and 30%. For instance, assuming a training partition which contains 100 examples, when the labeled rate is 10%, 10 examples are put into L with their labels while the remaining 90 examples are put into U without their labels. Note that the test partition (25 examples) is kept aside to evaluate the inductive performance of the learned hypothesis.
3.2 Performance measures
With the aim of measuring the effectiveness of the classification performed by the SSC techniques, we use two classical statistics: accuracy rate [63] and Cohen’s kappa rate [8]. The two measures are briefly explained as follows:
-
Accuracy: This measure reflects the agreement between the observed and predicted classes. It is a simple metric commonly employed for assessing the performance of classifiers.
-
Cohen’s kappa: This measure takes into account the successful hits that would be generated simply by chance. Cohen’s kappa ranges from −1 to 1, where a value of 0 means there is no agreement, a value of 1 indicates total agreement, and a negative value indicates that the prediction is in the opposite direction.
3.3 Algorithms used and parameters
In this section, we specify the configuration parameters of all the methods involved in this study. The selected values are uniformly used in all the datasets, and they were mostly selected taking into account the recommendations offered in previous works. The parameters are not optimized for each specific dataset because the main purpose of this experimental study is to compare the general performance of the self-labeled techniques. The configuration parameters are shown in Table 3.
Some of the self-labeled methods have their own built-in stopping criteria, which we use accordingly for these methods. For classifiers which have not a predefined stopping criterion we define it as follows. For each dataset, the self-labeled process stops when it satisfies the first of the following stopping criteria: (i) 70% of the unlabeled instances in the initial set U have been removed and inserted into L or (ii) the algorithm has reached a maximum number of 50 iterations. Here, InstPerIter is the number of instances removed from U for each iteration. The stopping criterion proposed facilitates the exploitation of U and avoids the extreme output caused by adding in the base learner all the unlabeled instances from U.
Most of the self-labeled methods include one or more base classifier(s). For those methods that support base classifiers from different approaches (SelfT and TriT), we explore all the possible combinations. In this study, we select as a base classifiers three methods that represent influent approaches of time series classification algorithms: kNN, DT and SVM.
1NN is a widely used classifier in the time series classification domain. Multiple studies [28, 36, 55, 65] related to time series similarity measures are based on this classifier.
The method proposed in Douzal-Chouakria and Amblard [21] is selected to construct DT specifically designed to classify time series data. This method obtains competitive results, and its split procedure is flexible enough to cover behavior and value proximities. The cost function \(c_b\) to evaluate the proximity between two series is evaluated as \(c_b(r)=2/(1+\exp (b \cdot \mathrm{Co}(r))) \cdot c(r)\), where r is a mapping between two series, \(\mathrm{Co}\) is the behavior-based cost function, c is the values-based cost function, and the parameter b modulates the influence of the behavior in the overall cost. This parameter has been empirically fixed to 2 (Table 3). As \(\mathrm{Co}\) function, we have used a variant of Pearson correlation involving first-order differences, and as c function, we have explored several time series measures.
The kernel function selected to construct the SVMs is Gaussian radial basis function (RBF), i.e., \(K_{d}(x_i,x_j)=\exp ({-d(x_i,x_j)^2/(2\sigma ^2)})\). Most previous studies [42, 47, 68] use it in combination with a distance measure d selected from the time series domain. Following the methodology in Marteau and Gibet [42], we normalize the pairwise distance matrix in the training stage to limit the search space of the parameters. Specifically, we use a predefined set of C and sigma values to select the most appropriate value during a cross-validation process. Additionally, other kernels were evaluated [16, 56], but result of an unfeasible computational cost in the self-labeled context.
Throughout the experimentation, we evaluate five different measures to compute the dissimilarity between instances: Euclidean, DTW, ERP, ACF and FFT. The Sakoe–Chiba band global constraint [54] is used to improve the performance of the elastic measures. Specifically, we fix the window size to 4 and 9% of the time series length for DTW and ERP, respectively, following the recommendation in Kurbalija et al. [36].
4 Results and discussion
This section presents the results obtained in the experiments and a detailed discussion of those. We evaluate the performance of the methods in two different settings: results obtained in transductive learning (Sect. 4.1) and inductive learning (Sect. 4.2), under three different ratios of labeled data. Section 4.3 presents an empirical analysis of the run-times obtained by the self-labeled techniques. Section 4.4 addresses the geometrical characteristics of the time series datasets and its influence in the performance of the techniques studied. A comparison between the supervised and semi-supervised learning paradigm is presented in Sect. 4.5. Finally, the discussion of all results is performed in Sect. 4.6.
We use nonparametric statistical tests to contrast the results obtained, following the methodology proposed in García et al. [27]. Concretely, we use the Aligned Friedman test [32] for multiple comparisons to detect statistically significant differences between the evaluated methods and the post-hoc procedure of Hochberg [31] to characterize those differences. In comparisons with only two algorithms involved, we use the Wilcoxon signed rank test, following the recommendation in Demšar [20].
4.1 Transductive results
As stated in Sect. 2.1, the main goal of transductive classification is to predict the class of the unlabeled data used during the training phase. Table 4 presents the average accuracy (Acc) results of the self-labeled methods involved in this study over the 35 datasets with 10, 20 and 30% of labeled data. The methods are presented in descending order of the accuracy. For those methods that support different base classifiers, we have explored all possible combinations specifying the classifier in the method’s name.
The difference between this ranking and the ranking obtained when we sort using the kappa results is denoted as \(\Delta \mathbf K \): positive values indicate that the method obtains a better position ranking by kappa, negative values indicate the opposite, and zero means no difference. \(\Delta K\) shows whether or not a certain method benefits from random hits. Table 4 reveals no significant difference between the two orders as only a handful of methods exhibit \(\Delta K\) values different from zero.
Figure 1 shows box and whisker plots of the methods under the dissimilarity measures studied. This illustration allows us to visualize in more detail the performance of the self-labeled methods. It shows the gain of accuracy caused by the use of DTW and ERP in comparison with the other measures. The superiority of DTW over Euclidean distance has been addressed in previous studies about time series classification problem. For instance, the extensive study performed by Serrà and Arcos [55] supports this conclusion. Furthermore, the study of Wang et al. [58] reveals that DTW and ERP are clearly superior to Euclidean distance. In this sense, our results confirm the advantage of these elastic measures in the semi-supervised context.
On the other hand, the methods combined with 1NN as a base classifier exhibit the better performance. By contrast, the lowest results are obtained in combination with DT. Moreover, the use of SVM as a base classifier causes a spread behavior of the accuracy values. Figure 2 presents the average results in a bar plot aiming to compare the accuracy values across different labeled ratios.
We perform a comparison of the accuracy among all single learning methods grouped by their learning scheme. This comparison allows us to determine the most successful self-labeled methods for each base classifier.
The Aligned Friedman test, applied to accuracy for all methods that use 1NN as a base classifier, detects significant differences in a significance level of \(\alpha = 0.05\) for all comparisons performed. Table 5 shows the rankings obtained. The most accurate method is chosen as the control method for the application of the post-hoc procedure. For Euclidean and FFT distance, the method selected is SETRED in all labeled ratios. For DTW, ACF and ERP, the method selected is TriT in most of the comparisons. SNNRCE and SelfT show the lowest values of accuracy. In the majority of comparisons, these methods are significantly outperformed by the control method, following the Hochberg post-hoc procedure.
Table 6 shows the application of the Wilcoxon signed ranks test to contrast the accuracy of the methods that use DT and SVM as a base classifiers. For all dissimilarity measures and labeled ratios, TriT outperforms SelfT using both base classifiers. The difference obtained results significant on a significance level of \(\alpha = 0.05\), with the exception of ERP at 10% of labeled data.
Finally, Table 7 offers a comparison between the most competitive methods from single learning and the multi-learning approach (Democratic). We consider as “competitive” those methods that have not been significantly outperformed more than twice in the 15 comparisons performed (five distance measures \(\times \) three labeled ratios) in Tables 5 and 6. Following this criterion, the outstanding methods selected are: SETRED and TriT.
The Aligned Friedman test, applied to accuracy, detects significant differences on a significance level of \(\alpha = 0.05\) for all comparisons performed. For the dissimilarity measures DTW and ACF, the control method selected is TriT-1NN in most cases. For the dissimilarity measures Euclidean, FFT and ERP, Democratic is selected as control method in most of the comparisons. SETRED exhibits a competitive behavior because is not outperformed by the control method in any of comparisons. By contrast, TriT-SVM and TriT-DT are significantly outperformed by the control method in most of the comparisons performed, with the exception of ERP where TriT-SVM exhibits a better behavior.
4.2 Inductive results
The main target of inductive learning is to classify instances not included in the training phase. This analysis is useful to test the previous learned hypotheses and their generalization abilities. Table 8 shows the average obtained using all dissimilarity measures studied. Figure 4 shows box and whisker plots of the same results grouped by ratios. Figure 3 shows a bar plot of the average accuracy reflecting the improvement obtained by increasing the amount of labeled examples.
Once more, we perform a comparison of the accuracy among all single learning methods grouped by their learning scheme. The aligned Friedman test, applied to the accuracy of all methods that use 1NN as base learning scheme, detects significant differences for all comparisons performed. Table 9 shows the rankings obtained. The control method selected is SETRED in most of the comparisons except for ACF where TriT is selected in two labeled ratios. In general, SNNRCE and SelfT show the lowest values of accuracy and are significantly outperformed by the control method in most of the comparisons following the Hochberg post-hoc procedure.
Table 10 shows the application of the Wilcoxon signed ranks test to the accuracy for the methods that use DT and SVM as a base classifiers. For all dissimilarity measures and labeled ratios, TriT outperforms significantly SelfT using both base classifiers, with the exception of SVM under ERP at 10% of labeled data.
Finally, Table 11 offers a comparison between the most competitive methods from single learning and the multi-learning approach. Once more, the outstanding methods selected are: SETRED and TriT. The aligned Friedman test, applied to accuracy, detects significant differences for all comparisons performed. For the dissimilarity measures Euclidean, FFT and ERP, the control method selected is Democratic in all comparisons. For ACF, TriT-1NN is selected as control method in most of the comparisons. SETRED exhibits the best behavior under the dissimilarity measure DTW. TriT-SVM and TriT-DT are significantly outperformed in most of the comparisons by the control method, with the exception of FFT and ERP where TriT-SVM exhibits a competitive behavior.
4.3 Experimental run-times
From the temporal analysis performed in Sect. 2.1, it is clear that the main source of temporal complexity is related to the successive operations of training the model(s) and classifying instances. The cost associated with this operations directly depends on the learning scheme(s) used. In this section, we present an empirical analysis of the run-times based on a sample of 20 datasets included in the experimentation. All experiments were performed in a cluster conformed by 46 nodes, each one equipped with an Intel\(^{\circledR }\) Core\(^{TM }\) i7-930 processor at 2.8 GHz and 24 GB of RAM memory. Under GNU/Linux, we ran the experiments using R version 3.3.1 [48].
During the training and classification process, we provide the time series datasets to the self-labeled techniques using a distance matrix form. This avoids the repetitions of distance calculations, and therefore it reduces the running time. For that reason, the time consumption associated with the distance matrices computation are not considered in the current analysis.
Table 12 contains the run-times of the self-labeled techniques using a sequential execution of the fivefold cross-validation procedure. TriT-1NN obtains the lower run-times in all the cases. The 1-NN base classifier produces the shortest run-times, whereas the SVM produces the longest ones. This is caused by the time consumption involved in the construction of SVM with the threefold cross-validation process to adjust the parameters C and \(\sigma \). Democratic obtains expensive run-time because it trains a classifier for each learning scheme. SETRED and SNNRCE obtain competitive results as the base classifier trained is 1NN.
4.4 Impact of the geometrical characteristics of datasets
The geometrical characteristics of the datasets affect the performance of the self-labeled methods. The overlapping of samples is a common source of difficulty in the classification process. In addition, the reduced labeled sample in the SSC framework and the high dimensionality of time series data introduces another layer of complexity. In Wang et al. [59], the overlapping of samples from different classes is investigated in order to offer an explanation of the decreasing performance experimented by the SSC algorithms in datasets that suffer this problem.
We follow a similar idea by computing the neighbors of each training instance (from U) in a neighborhood graph constructed from \(L \cup U\). The proportion of neighbors (from L) with a different class with respect to the training instance analyzed, is used as an overlapping measure. Table 13 includes this overlapping measured averaged for all training examples of each dataset. The neighborhood graph was computed for each dissimilarity measure and proportion of labeled data. The datasets with high proportions of neighbors with different class are related to high levels of overlap between classes.
In order to investigate the impact of the overlapping in the classification performance, we correlate the values of Table 13 with the accuracy obtained with the self-labeled techniques. Figure 5 shows the correlations obtained between the two variables studied (accuracy and overlapping). For all dissimilarity measures and proportions of labeled data studied, both variables present an strong inverse correlation. This means that the presence of overlapping in datasets affects, in a significant manner, the performance of the self-labeled techniques.
Considering the impact of overlapping in the techniques performance, we present a study about the tuning of some parameters based on the level of overlapping in the datasets. Specifically, we study the parameter significance threshold that controls the addition mechanism in the methods SETRED and SNNRCE. In the case of SETRED, this parameter controls the hypothesis used to decide if an specific example must be added or not to the labeled set L. The smaller this value, the more restrictive the selection of examples that are considered good is. In the case of SNNRCE, the significance threshold is related to the hypothesis used to determine if an example is considered as a “doubt example.” The greater this value, the more examples will be considered as doubt and accordingly to the SNNRCE method they will be relabeled.
Figures 6 and 7 show the behavior of the significance threshold parameter throughout different levels of overlap. The significance threshold selected is the value, between three possible values (0.05, 0.10 and 0.15), that maximizes the accuracy of SETRED. In general, the most appropriated threshold for datasets seems to be the most restrictive value 0.05. This value benefits datasets with low overlapping. For other datasets, including those with medium or high degree of overlap are more flexible values of significance threshold preferred. This behavior is more noticeable at 30% of labeled data.
Figures 8 and 9 show a similar behavior for the SNNRCE method. For this method coincides as the most suitable option for the significance threshold the value 0.05. Although, for some datasets with more overlapping, the values 0.10 and 0.15 result a better option. In contrast to SETRED, this behavior is more noticeable at 10% of labeled data.
4.5 Can semi-supervised learning improve classification performance?
A recommendable procedure [12] in presence of labeled and unlabeled data is to start by learning a supervised classifier from the labeled data, named “baseline classifier.” A comparison with this classifier allows us to identify situations where the addition of unlabeled data causes performance degradation of the classifier obtained. In this section, we perform such analysis measuring the accuracy gain obtained with the addition of unlabeled data during the training phase. We estimate the accuracy gain subtracting the accuracy obtained with supervised classification from the accuracy obtained with SSC. In both cases, the classifier performance is evaluated on the testing set, using the same fivefold cross-validation scheme. We select as the baseline method the 1NN classifier because it offers the most accurate results.
We expect that the best performing self-labeled methods selected from previous sections will obtain the highest accuracy gain. For this reason, we focus this analysis on those methods. Figure 10 shows the accuracy gain obtained for each dataset using three semi-supervised methods. A negative gain means performance degradation of the classifier. We can observe a very diverse behavior of the gains under the different labeled ratios and methods. There are datasets that do not benefit from SSL, for instance ECG [60] and Wafer. The size of these datasets already causes that the hypothesis learned by the supervised baseline is perfectly capable of obtaining accurate classification results in the inductive phase. On other datasets, such as MedicalI, classification performance deteriorates with the addition of unlabeled data for 10 and 20% labeled ratio. This is the case as the initial labeled data provided are insufficient for training a correct model where unlabeled data will be truly beneficial. It is noticeable that MedicalI is a multi-class dataset (10 classes) with high overlapping. This adverse situation starts to reverse at 30% labeled ratio where Democratic obtains a positive accuracy difference gain.
Though there are unfavorable situations for SSL in some datasets, Fig. 11 shows the gains obtained with SETRED, in decreasing order. In general, at 20% labeled data a significant positive gain can be observed. We also see that it depends heavily on the ratio of labeled instances if there is benefit and how big it is. Summarizing, the circumstances that make SSL a suitable approach for a particular dataset depend on the modeling assumptions adopted for the classifier, as well as the characteristics of the time series data.
In addition to the performed analysis, we consider the baseline classifier trained on the fully labeled training set and evaluated on the testing set. These accuracy results can be considered as an upper bound for the self-labeled methods. Figure 12 shows the semi-supervised classification accuracy bounded by the baseline classifier. In general, the semi-supervised results in most of the datasets are competitive compared with the upper bound considered. Interestingly, for datasets such as StarLightC and Synthetic the multi-learning hypothesis, learned by Democratic, outperforms the upper bound classifier.
4.6 General discussion
This section gives a general discussion of the properties observed throughout this study. In addition, we highlight the methods that perform best in general.
-
For most of the methods, accuracy increases with an increase of labeled examples. However, this increase is usually rather moderate in most methods. In self-labeled methods with SVMs as base classifiers, the increase is bigger than in the other methods.
-
The classical SelfT is clearly outperformed by other self-labeled techniques independently of the learning approach and the dissimilarity measure used.
-
Usually, there is no difference between the obtained rankings with accuracy and kappa statistics. This means that there is no significant difference in the way that the classifiers benefit from random hits.
-
In general, 1NN offers the best transductive and inductive results as base classifier in most self-labeled methods. In addition, this base classifier yields the most competitive run-times in comparison with SVM and DT.
-
Although SVM and DT do not offer competitive results as base classifiers, when these learning schemes are combined with 1NN, following a multi-learning scheme (Democratic), good results are obtained. However, the increment of the run-time is a side effect of the Democratic method.
-
The use of DTW and ERP distance results in a gain of accuracy in comparison with the other measures studied. In the case of DTW, this advantage is reduced in presence of SVM as a base classifier. This behavior is caused by the indefiniteness of the kernels constructed under DTW.
-
SETRED, Democratic and TriT-1NN are the best performing methods in this study. TriT-SVM also exhibits a competitive behavior under FFT and ERP dissimilarity measures. For Euclidean, ERP and FFT, we recommend the use of Democratic. For DTW, we recommend TriT-1NN and SETRED for transductive and inductive scenarios, respectively. For ACF, we recommend TriT-1NN for either inductive and transductive learning.
-
The overlapping in datasets is an aspect that should be taken into account during the solution of time series classification problems. We find strong evidence about the negative effects of overlap in the accuracy of the self-labeled techniques.
-
From the study of the significance threshold parameter, we recommend in general the use of the most restrictive value 0.05 in both methods SETRED and SNNRCE. In particular, for datasets with high levels of overlap, other values of this parameter must be considered.
5 Conclusion
We have investigated the applicability of different self-labeled methods for semi-supervised learning in a time series context. In addition to the popular SelfT with 1NN as base learner, we have explored other combinations of self-labeled methods and learning schemes that had not been applied in a time series context, to the best of our knowledge. We can conclude with the following remarks:
-
In general, 1NN is a robust choice for the base classifier in the semi-supervised context as it offers the most accurate results and no parameters have to be tuned.
-
SelfT is always outperformed by other self-labeled methods such as TriT and SETRED.
-
Our empirical study allows us to highlight three methods, in particular SETRED, TriT-1NN and Democratic, that perform significantly better than the rest in terms of transductive and inductive capabilities.
-
The use of ensembles of classifiers (TriT-1NN and Democratic) is a very promising attempt to perform SSL in the time series context. This is in line with recent studies [3, 39] in supervised classification of temporal data.
-
Taking into account the underlying risk to classification performance caused by the addition of unlabeled data, we recommend a comparison of the SSL results with the 1NN as baseline classifier to identify the real benefits of learning with unlabeled data. The overlapping in datasets is other aspect that should be taken into account in the selection of the classification techniques. Specifically, the performance of the self-labeled techniques can be seriously affected by the presence of overlapping.
References
Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
Bagnall AJ, Janacek GJ (2004) Clustering time series from ARMA models with clipped data. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining’, KDD ’04. ACM, New York, NY, pp49–58
Bagnall A, Lines J, Hills J, Bostrom A (2015) Time-series classification with COTE: the collective of transformation-based ensembles. IEEE Trans Knowl Data Eng 27(9):2522–2535
Balakrishnan S, Madigan D (2006) Decision trees for functional variables. In: Sixth international conference on data mining, ICDM ’06, pp 798–802
Batista G, Hao Y, Keogh E, MafraNeto A (2011) Towards automatic classification on flying insects using inexpensive sensors. In: IEEE 10th international conference on machine learning and applications (ICMLA). IEEE, , vol 1, pp 364–369
Begum N, Hu B, Rakthanmanon T, Keogh E (2014) A minimum description length technique for semi-supervised time series classification. In: Integration of reusable systems, vol 263 of advances in intelligent systems and computing. Springer, Berlin, pp 171–192
Behera H, Dash P, Biswal B (2010) Power quality time series data mining using S-transform and fuzzy expert system. Appl Soft Comput 10(3):945–955
Ben-David A (2007) A lot of randomness is hiding in accuracy. Eng Appl Artif Intell 20(7):875–885
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Eleventh annual conference on computational learning theory, COLT’ 98. ACM, New York, NY, pp 92–100
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Carden EP, Brownjohn JM (2008) ARMA modelled time-series classification for structural health monitoring of civil infrastructure. Mech Syst Signal Process 22(2):295–314
Chapelle O, Schölkopf B, Zien A (2006) Semi-supervised learning, vol 2. MIT Press, Cambridge
Chen L, Özsu MT, Oria V (2005) Robust and fast similarity search for moving object trajectories, In: Proceedings of the 2005 ACM SIGMOD international conference on management of data, SIGMOD ’05. ACM, New York, NY, pp 491–502
Chen Y, Hu B, Keogh E, Batista GE (2013) DTW-D: time series semi-supervised learning from a single example. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’13. ACM, New York, NY, pp 383–391
Chen Y, Keogh E, Hu B, Begum N, Bagnall A, Mueen A, Batista G (2015) The ucr time series classification archive. www.cs.ucr.edu/~eamonn/time_series_data/
Cuturi M (2011) Fast global alignment kernels. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 929–936
Dash P, Behera H, Lee I (2008) Time sequence data mining using time-frequency analysis and soft computing techniques. Appl Soft Comput 8(1):202–215
De Sousa CAR, Souza VMA, Batista GEAPA (2014) Time series transductive classification on imbalanced data sets: an experimental study. In: 22nd international conference on pattern recognition (ICPR), pp 3780–3785
De Sousa CAR, Souza VMA, Batista GEAPA (2015) An experimental analysis on time series transductive classification on graphs. In: International joint conference on neural networks (IJCNN), pp 1–8
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Douzal-Chouakria A, Amblard C (2012) Classification trees for time series. Pattern Recogn 45(3):1076–1091
Faloutsos, C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: Proceedings of the 1994 ACM SIGMOD international conference on management of data, SIGMOD ’94. ACM, New York, NY, pp 419–429
Flesca S, Manco G, Masciari E, Pontieri L, Pugliese A (2007) Exploiting structural similarity for effective web information extraction. Data Knowl Eng 60(1):222–234
Frank J, Mannor S, Pineau J, Precup D (2013) Time series analysis using geometric template matching. IEEE Trans Pattern Anal Mach Intell 35(3):740–754
Fu T (2011) A review on time series data mining. Eng Appl Artif Intell 24(1):164–181
Fulcher BD, Jones NS (2014) Highly comparative feature-based time-series classification. IEEE Trans Knowl Data Eng 26(12):3026–3037
García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064
Geler Z, Kurbalija V, Radovanović M, Ivanović M (2015) Comparison of different weighting schemes for the kNN classifier on time-series data. Knowl Inf Syst 48:331–378
Goldman SA, Zhou Y (2000) Enhancing supervised learning with unlabeled data. In: Seventeenth international conference on machine learning (ICML), pp 327–334
González M, Bergmeir C, Triguero I, Rodríguez Y, Benítez JM (2016) On the stopping criteria for k-nearest neighbor in positive unlabeled time series classification problems. Inf Sci 328:42–59
Hochberg Y, Rom D (1995) Extensions of multiple testing procedures based on Simes’ test. J Stat Plan Inference 48(2):141–152
Hodges J, Lehmann EL et al (1962) Rank methods for combination of independent experiments in analysis of variance. Ann Math Stat 33(2):482–497
Jeong Y, Jayaraman R (2015) Support vector-based algorithms with weighted dynamic time warping kernel function for time series classification. Knowl-Based Syst 75:184–191
Kaya H, GunduzOguducu S (2015) A distance based time series classification framework. Inf Syst 51:27–42
Kim M (2013) Semi-supervised learning of hidden conditional random fields for time-series classification. Neurocomputing 119:339–349
Kurbalija V, Radovanović M, Geler Z, Ivanović M (2014) The influence of global constraints on similarity measures for time-series databases. Knowl-Based Syst 56:49–67
Lei H, Sun B (2007) A study on the dynamic time warping in kernel machines. In: Third international IEEE conference on signal-image technologies and internet-based system (SITIS), SITIS ’07, pp 839–845
Li M, Zhou Z (2005) Setred: self-training with editing. In: Advances in knowledge discovery and data mining, vol 3518 of Lecture notes in computer science. Springer, Berlin, pp 611–621
Lines J, Bagnall A (2014) Time series classification with ensembles of elastic distance measures. Data Min Knowl Discov 29(3):565–592
Liu Y, Yao K, Liu S, Raghavendra CS, Balogun O and Olabinjo L (2011) Semi-supervised failure prediction for oil production wells. In: IEEE 11th international conference on data mining workshops (ICDMW). IEEE, pp 434–441
Marteau PF (2009) Time warp edit distance with stiffness adjustment for time series matching. IEEE Trans Pattern Anal Mach Intell 31(2):306–318
Marteau P, Gibet S (2015) On recursive edit distance kernels with application to time series classification. IEEE Trans Neural Netw Learn Syst 26(6):1121–1133
Marussy K, Buza K (2013) Success: a new approach for semi-supervised classification of time-series. In: Rutkowski L, Korytkowski M, Scherer R, Tadeusiewicz R, Zadeh L, Zurada J (eds) Artificial intelligence and soft computing, vol 7894. Lecture notes in computer science. Springer, Berlin, pp 437–447
Meng J, Wu L, Wang X, Lin T (2011) Granulation-based symbolic representation of time series and semi-supervised classification. Comput Math Appl 62(9):3581–3590
Petitjean F, Forestier G, Webb GI, Nicholson AE, Chen Y, Keogh E (2016) Faster and more accurate classification of time series by exploiting a novel dynamic time warping averaging algorithm. Knowl Inf Syst 47(1):1–26
Povinelli R, Johnson M, Lindgren A, Ye J (2004) Time series classification using gaussian mixture models of reconstructed phase spaces. IEEE Trans Knowl Data Eng 16(6):779–783
Pree H, Herwig B, Gruber T, Sick B, David K, Lukowicz P (2014) On general purpose time series similarity measures and their use as kernel functions in support vector machines. Inf Sci 281:478–495
R Core Team (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Rabiner L (1989) A tutorial on hidden markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286
Rakthanmanon T, Campana B, Mueen A, Batista G, Westover B, Zhu Q, Zakaria J, Keogh E (2012) Searching and mining trillions of time series subsequences under dynamic time warping. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’12. ACM, New York, NY, pp 262–270
Ratanamahatana CA, Wanichsan D (2008) Stopping criterion selection for efficient semi-supervised time series classification. In: Lee R (ed) Software engineering, artificial intelligence, networking and parallel/distributed computing, vol 149. Studies in computational intelligence. Springer, Berlin, pp 1–14
Rodríguez JJ, Alonso CJ (2004) Interval and dynamic time warping-based decision trees. In: Proceedings of the 2004 ACM symposium on applied computing, SAC ’04. ACM, pp 548–552
Rodríguez JJ, Alonso CJ, Boström H (2000) Learning first order logic time series classifiers: rules and boosting. In: Principles of data mining and knowledge discovery, vol 1910 of Lecture notes in computer science. Springer, Berlin, pp 299–308
Sakoe H, Chiba S (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans Acoust Speech Signal Process 26(1):43–49
Serrà J, Arcos JL (2014) An empirical evaluation of similarity measures for time series classification. Knowl-Based Syst 67:305–314
Shimodaira H, Noma K, Nakai M, Sagayama S (2001) Dynamic time-alignment kernel in support vector machine. In: Proceedings of the 14th international conference on neural information processing systems: natural and synthetic, NIPS’01. MIT Press, Cambridge, MA, pp 921–928
Triguero I, García S, Herrera F (2015) Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl Inf Syst 42(2):245–284
Wang X, Mueen A, Ding H, Trajcevski G, Scheuermann P, Keogh E (2013) Experimental comparison of representation methods and distance measures for time series data. Data Min Knowl Discov 26(2):275–309
Wang Y, Xu X, Zhao H, Hua Z (2010) Semi-supervised learning based on nearest neighbor rule and cut edges. Knowl-Based Syst 23(6):547–554
Wei L (2006) Datasets used for experimental evaluation in the paper: semi-supervised time series classification. www.cs.ucr.edu/~wli/selfTraining/
Wei L, Keogh E (2006) Semi-supervised time series classification. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 748–753
Weng X, Shen J (2008) Classification of multivariate time series using two-dimensional singular value decomposition. Knowl-Based Syst 21(7):535–539
Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, third, edition edn. Morgan Kaufmann, Boston
Xi X, Keogh E, Shelton C, Wei L, Ratanamahatana CA (2006) Fast time series classification using numerosity reduction. In: Proceedings of the 23rd international conference on machine learning, ICML ’06. ACM, New York, pp 1033–1040
Xing Z, Pei J, Yu PS (2012) Early classification on time series. Knowl Inf Syst 31(1):105–127
Yamada Y, Suzuki E, Yokoi H, Takabayashi K (2003) Decision-tree induction from time-series data based on a standard-example split test. In: Twentieth international conference on machine learning, vol 3 of ICML ’03, pp 840–847
Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 189–196
Zhang D, Zuo W, Zhang D, Zhang H (2010) Time series classification using support vector machine with Gaussian elastic metric kernel. In: 20th international conference on pattern recognition (ICPR), ICPR ’10, pp 29–32
Zhou Y, Goldman S (2004) Democratic co-learning. In: IEEE 16th international conference on tools with artificial intelligence (ICTAI). IEEE, pp 594–602
Zhou Z, Li M (2005) Tri-training: exploiting unlabeled data using three classifiers. IEEE Trans Knowl Data Eng 17(11):1529–1541
Zighed DA, Lallich S, Muhlenbach F (2002) Principles of data mining and knowledge discovery: 6th European conference. In: Separability index in supervised learning, PKDD 2002 Helsinki, Finland, August 19–23, 2002 Proceedings. Springer, Berlin, pp 475–487
Acknowledgements
We thank anonymous reviewers for their very useful comments and suggestions. This work was supported in part by “Proyecto de Investigación de Excelencia de la Junta de Andalucía, P12-TIC-2958,” “Proyecto de Investigación del Ministerio de Economía y Competitividad, TIN2013-47210-P” and TIN-2016-81113-R. This work was partly performed while M. González held a travel grant from the Asociación Iberoamericana de Postgrado (AUIP), supported by Junta de Andalucía, to undertake a research stay at University of Granada.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
González, M., Bergmeir, C., Triguero, I. et al. Self-labeling techniques for semi-supervised time series classification: an empirical study. Knowl Inf Syst 55, 493–528 (2018). https://doi.org/10.1007/s10115-017-1090-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-017-1090-9