1 Introduction

In the time series field, the semi-supervised learning (SSL [12]) paradigm has received a lot of research attention during the past decade. As cheap sensors of all kinds become more and more available, vast amounts of unlabeled time series data are generated. By contrast, the cost related to the labeling process makes it often unfeasible to obtain a fully labeled training set. In this situation, SSL is a good solution to improve the learning accuracy taking advantage of the unlabeled data in conjunction with a small set of labeled data. Specifically, semi-supervised classification (SSC) focuses on training a classifier such that it outperforms a supervised classifier trained on the labeled data alone. In the time series domain, SSL has found a wide range of applications such as the classification of flying insects [5], web information extraction [23] and failure prediction in oil production [40]. In this work, we tackle SSC problems in the time series classification context.

Time series data are characterized by a high dimensionality and its numerical and continuous nature. Therefore, a special treatment must be considered to deal with time series classification [25]. A first category of proposals, called feature-based approach [7, 11, 17, 26, 62], transforms the original time series into a new description space where conventional classifiers can be applied. Signal processing or statistical tools are commonly used to extract features from the original raw data. A second category [21, 34, 46, 52, 53, 64] focuses on customizing or developing classifiers specifically designed for time series data. This category, which includes the instance-based approach, is mostly based on the selection of an appropriate representation of the time series and a suitable measure of dissimilarity. Several representations and dissimilarity measures have been proposed to deal with the time series classification problem including: spectral approaches [22], autocorrelation function [2] and elastic measures [13, 41, 54]. Our paper considers this second approach.

Several SSC approaches have been applied to the time series classification problem. The work of Marussy and Buza [43] uses the cluster-then-label approach by a constrained hierarchical single link clustering method. The work of Frank et al. [24] applies a similar approach with a similarity measure called geometric template matching. The applicability of graph-based methods to time series classification is addressed in various works [18, 19]. The classical semi-supervised support vector machines method is extended to tackle time series classification by Kim [35].

Another family of SSC methods, denoted self-labeled techniques [57], aims to enlarge the original labeled set using the most confident predictions to classify unlabeled data. In contrast to the previous mentioned approaches, self-labeled techniques do not make any special assumptions about the distribution of the input data. Self-training [67] and co-training [9] are the most popular self-labeled techniques. Both approaches have been applied in a time series context.

Self-training is a wrapper method that iteratively classifies the unlabeled examples, assuming that its more accurate predictions tend to be correct. In the time series domain, the self-training technique is mostly applied to a particular case of SSC, called positive unlabeled learning, within which only examples from one class are available [6, 14, 30, 51, 61]. In conjunction with the self-training, the k-nearest-neighbor (kNN [1]) is typically used as the base learner as it has shown to be particularly effective for time series classification tasks [55, 58].

Co-training is a SSC method that requires two conditionally independent views which are sufficient for learning and classification. For each view, the unlabeled instances with highest confidence are selected and labeled to be turned into additional training data for the other view. The multi-view requirement of the co-training technique is typically too strong and difficult to meet in the time series domain where in most cases observations close together are correlated. The work of Meng et al. [44] applies a variant of co-training [29] which uses the hidden Markov model [49] and one-nearest-neighbor (1NN) as two different learners instead of the classical two views of the data.

There are various works in the literature [6, 14, 44, 51, 61] that focus on the SSC of time series, which involve self-labeled techniques. Self-training and co-training are the only self-labeled techniques that have been applied in a time series context so far, to the best of our knowledge. Our study broadens this approach to other self-labeled techniques, therewith gaining new insights and allowing for more detailed conclusions on this topic. Moreover, despite the successful application of 1NN to time series classification tasks, the use of different learning approaches as a base classifiers seems to be an under-explored area. These reasons motivate our paper, which has three main objectives as follows:

  • To explore the applicability of classical self-labeled techniques in the time series domain, as well as the use of other classification schemes as base learners in addition to the well-known 1NN.

  • To identify the best methods for each base learner under different ratios of labeled data and dissimilarity measures.

  • To determine the influence of the geometrical characteristics of time series datasets in the performance of the self-labeled techniques.

The rest of the paper is organized as follows. In Sect. 2, we provide definitions and the notation of SSC in the time series domain. Furthermore, we discuss the main characteristics of the self-labeled techniques and the base learners involved in this study. In Sect. 3, we introduce the experimental framework. In Sect. 4, we present the results obtained and discuss them. Finally, Sect. 5 concludes the paper.

2 Semi-supervised time series classification

In this section, we define the SSC of time series problem and the principal notation and definitions. Furthermore, we review the self-labeled techniques and the base learner methods involved in this study (Sects. 2.12.2).

In SSC, the source dataset has two parts, L and U. Let L be the set of instances \(\{{x}_{1},{\dots },{x}_{l}\}\) for which the labels \(\{{y}_{1},{\dots },{y}_{l}\}\) are provided. Let U be the set of instances \(\{{x}_{l+1},{\dots },{x}_{l+u}\}\) for which the labels are not known. We follow the typical assumption that there is much more unlabeled than labeled data, i.e., \(u \gg l\). The whole set \(L \cup U\) forms the training set.

Throughout this study, each instance \({x}_{i}, i=1,\dots ,l+u,\) represents a univariate, real-valued and evenly spaced time series. In this case, the time series \(x_i=[p_1,p_2,\dots ,p_n]\) is considered a sequence of 1-dimensional data points.

Depending on the goal, we can categorize SSC into two slightly different settings [12], namely transductive and inductive learning. The former is devoted to predict the labels on the unlabeled instances U provided during the training phase. The latter aims to predict the labels on unseen testing instances. In this work, we delve into both settings aiming to perform an extensive analysis of the selected methods.

2.1 Self-labeled techniques

Self-labeled techniques follow a wrapper methodology using one or more supervised classifiers to determine the most likely class of the unlabeled instances during an iterative process. The base classifier(s) play(s) a crucial role in the estimation of the most confident instances of U. The main feature that distinguishes self-labeled methods is the way they obtain one or several enlarged labeled sets to efficiently represent the learned hypothesis from the training set. In the literature, there are several proposals that follow this approach which differ mainly in the following aspects:

  • Addition mechanism There is a variety of schemes in which the enlarged set can be formed. The most used ones are incremental and amending. The former adds, step-by-step, the most confident instances from U to the enlarged set. The latter allows rectifications to already labeled instances to avoid the introduction of noisy instances in the enlarged set(s).

  • Classification model Self-labeled techniques can use one or more base classifiers to establish the class of unlabeled instances. Single-classifier models assign to each instance the most likely class considering the used classifier. Multi-classifier models combine the hypothesis learned by several classifiers to estimate the class by agreement of classifiers or combination of the probabilities obtained by single-classifiers.

  • Learning approach Independently of the number of base classifiers, the number of learning methods is another important issue. The single learning approach can be linked with single- and multi-classifier models. By contrast, the multi-learning approach is intrinsically related to the multi-classifier model. In a multi-learning method, the different classifiers come from different learning methods.

  • Stopping criterion This is the mechanism used to stop the self-labeling process preventing the addition of labeled instances in L with a low confidence level. Often, a prefixed number of iterations is used as stopping criterion. Another criterion used is the occurrence of non-changes in the learned hypotheses during successive iterations of the self-training process.

Since each approach has its own benefits and drawbacks, we include in this study a representative sample of methods. The selection made is based on the results obtained in the extensive overview study of Triguero et al. [57], and it includes the following methods:

  • Standard self-training (SelfT [67]): is a single-classifier and single learning method that extends the set L with the most confident examples extracted and classified from U, during an iteratively and incremental process. The stopping criterion consists in a fixed number of iterations that can be adapted to the original size of U. Following a wrapper methodology, the base classifier used by self-training is considered as another parameter of the method.

  • Self-training with editing (SETRED [38]): is a self-training variant with a different addition mechanism. SETRED introduces a data editing method to filter the noise examples that has been labeled by the base classifier. For each iteration, the mislabeled examples are identified using the local information provided by the neighborhood graph [71].

  • Self-training nearest-neighbor rule using cut edges (SNNRCE [59]): is a variant of SETRED that includes a first stage where the most confident examples are added to L. In the next stage, the self-training standard is applied in combination with the 1NN rule as a base classifier. The iterative process stops when the expected number of examples in the minority class is reached, according to the distribution observed in L. In the final stage, the mislabeled examples are relabeled attending to the information provided by the neighborhood graph.

  • Tri-training (TriT [70]): is a variant of co-training that trains three instead of the traditional two classifiers. These classifiers have in common the same learning scheme. The diversity between the base classifiers is obtained through manipulating the original set L, for example, using Bagging [10]. For each iteration, the selected examples from U are labeled and added to the training set of a specific classifier only if there is agreement between the remaining classifiers and some conditions are satisfied. The stopping criterion is reached when the hypothesis of the three classifiers does not suffer any modification during a complete iteration.

  • Democratic co-learning (Democratic [69]): is a multi-classifier and multi-learning method. The specific number of classifiers and its learning scheme are established as arguments of the method. Initially, all classifiers are trained using the examples in L. For each iteration, a label for each unlabeled example is proposed via majority vote. If the classification provided by a classifier disagrees with most classifiers, in a particular example, then this example is included in the training set of the classifier. The iteratively process stops when the training sets of the classifiers do not suffer any additions during a complete iteration.

Table 1 summarizes the principal characteristics of the self-labeled methods selected.

Table 1 Summary of the self-labeled methods selected

The variety of stopping criteria, associated with the self-labeled methods, makes difficult in estimating the maximum number of iterations performed by each method. For simplicity, we focus the temporal analysis on the complexity related to the execution of each iteration in the main loop of the method. The algorithmic complexity is based on the current number of unlabeled examples (u) and labeled examples (l) at the beginning of the iteration. Table 1 includes the time analysis of each method. The functions \(T_\mathrm{t}\) and \(T_\mathrm{c}\) represent the time cost associated with an specific learning scheme in the task of training the model and classifying new instances, respectively. \(l'\) is the number of candidate examples selected to increase the set L, and \(l''\) is the resulting number of examples after the filtering process. In the case of the SETRED method, the construction of the neighborhood graph has a cubic complexity. For the analysis of the Democratic method, we take into account the existence of N different learning schemes.

2.2 Supervised approaches for time series classification

Different approaches have been used to face the time series classification problem such as kNN classifiers, decision trees (DT) or support vector machines (SVMs).

The kNN classifier has been widely applied in the time series context [28, 45]. This classifier approximates the confidence in terms of dissimilarity between instances. There are several distance measures presented in the literature that have been used for evaluating dissimilarity between time series: lock-step measures (Euclidean), feature-based measures (Fourier coefficients), model-based measures (autocorrelation functions), and elastic measures.

The construction of DT is another approach applied to time series classification. Yamada et al. [66] propose two binary split tests called the standard-example and the cluster-example. The former selects an existing instance as the standard time series, and the members of the child nodes are selected depending on their distances to this selected instance. The later split searches for two standard time series to bisect the set of instances. A similar idea is followed by Balakrishnan and Madigan [4] with a clustering-goodness criterion which searches for the pair of time series that best bisects the set of instances. In both works, the dynamic time warping (DTW [54]) distance is used. A new split criterion based on an adaptive metric that covers both behavior and value proximities is developed in Douzal-Chouakria and Amblard [21].

On the other hand, SVMs are a popular technique that has been applied to time series classification. The work of Pree et al. [47] compares several similarity measures used as kernel function in SVM. In contrast to other learning approaches, the performance of the SVM constructed with Euclidean distance significantly outperforms those obtained using DTW distance. The reason of this behavior has been analyzed in multiple works [37, 68]. It is caused by the indefiniteness of the kernel constructed with DTW. The use of classical recursive elastic distances to construct recursive edit distance kernels is addressed in Marteau and Gibet [42]. The kernels constructed in this way are positive definite if some sufficient conditions are satisfied. Moreover, the construction of a weighted DTW kernel to classify time series data is proposed in Jeong and Jayaraman [33].

In this study, we use as base learners three instance-based classifiers representative for those classification approaches, namely kNN, DT and SVM.

3 Experimental framework

This section presents the information related to the datasets involved in the study in Sect. 3.1. The performance measures and the configuration parameters of the algorithms used are addressed in Sects. 3.2 and 3.3, respectively.

3.1 Datasets

The experimentation is based on 35 standard classification datasets taken from public available repositories [15, 60]. Table 2 summarizes the main properties of the selected datasets. The datasets involved in this study contain between 56 and 9236 instances, the time series length ranges from 24 to 1882, and the number of classes varies between 2 and 14. For each dataset, the time series are z-normalized, following the recommendation in Rakthanmanon et al. [50].

The datasets are randomly divided using a fivefold cross-validation procedure. Each training partition (4/5 of the total set of examples) is randomly divided into two sets: L and U, of labeled and unlabeled (i.e., the labels are withheld and not available to the algorithm) examples, respectively. Following the approach of Triguero et al. [57] and Wang et al. [59], we do not attempt to keep the class proportion in the L and U sets the same as in the whole training set. The class label of the instances selected to form the set U is removed. We make sure that every class has been represented in L.

With the purpose of studying the influence of the amount of labeled data, we take different ratios when dividing the training set. In our experiments, three ratios are used: 10, 20 and 30%. For instance, assuming a training partition which contains 100 examples, when the labeled rate is 10%, 10 examples are put into L with their labels while the remaining 90 examples are put into U without their labels. Note that the test partition (25 examples) is kept aside to evaluate the inductive performance of the learned hypothesis.

Table 2 Summary description of the times series datasets

3.2 Performance measures

With the aim of measuring the effectiveness of the classification performed by the SSC techniques, we use two classical statistics: accuracy rate [63] and Cohen’s kappa rate [8]. The two measures are briefly explained as follows:

  • Accuracy: This measure reflects the agreement between the observed and predicted classes. It is a simple metric commonly employed for assessing the performance of classifiers.

  • Cohen’s kappa: This measure takes into account the successful hits that would be generated simply by chance. Cohen’s kappa ranges from −1 to 1, where a value of 0 means there is no agreement, a value of 1 indicates total agreement, and a negative value indicates that the prediction is in the opposite direction.

3.3 Algorithms used and parameters

In this section, we specify the configuration parameters of all the methods involved in this study. The selected values are uniformly used in all the datasets, and they were mostly selected taking into account the recommendations offered in previous works. The parameters are not optimized for each specific dataset because the main purpose of this experimental study is to compare the general performance of the self-labeled techniques. The configuration parameters are shown in Table 3.

Table 3 Parameter specification for the base learners and self-labeled methods involved in the experimentation

Some of the self-labeled methods have their own built-in stopping criteria, which we use accordingly for these methods. For classifiers which have not a predefined stopping criterion we define it as follows. For each dataset, the self-labeled process stops when it satisfies the first of the following stopping criteria: (i) 70% of the unlabeled instances in the initial set U have been removed and inserted into L or (ii) the algorithm has reached a maximum number of 50 iterations. Here, InstPerIter is the number of instances removed from U for each iteration. The stopping criterion proposed facilitates the exploitation of U and avoids the extreme output caused by adding in the base learner all the unlabeled instances from U.

Most of the self-labeled methods include one or more base classifier(s). For those methods that support base classifiers from different approaches (SelfT and TriT), we explore all the possible combinations. In this study, we select as a base classifiers three methods that represent influent approaches of time series classification algorithms: kNN, DT and SVM.

1NN is a widely used classifier in the time series classification domain. Multiple studies [28, 36, 55, 65] related to time series similarity measures are based on this classifier.

The method proposed in Douzal-Chouakria and Amblard [21] is selected to construct DT specifically designed to classify time series data. This method obtains competitive results, and its split procedure is flexible enough to cover behavior and value proximities. The cost function \(c_b\) to evaluate the proximity between two series is evaluated as \(c_b(r)=2/(1+\exp (b \cdot \mathrm{Co}(r))) \cdot c(r)\), where r is a mapping between two series, \(\mathrm{Co}\) is the behavior-based cost function, c is the values-based cost function, and the parameter b modulates the influence of the behavior in the overall cost. This parameter has been empirically fixed to 2 (Table 3). As \(\mathrm{Co}\) function, we have used a variant of Pearson correlation involving first-order differences, and as c function, we have explored several time series measures.

The kernel function selected to construct the SVMs is Gaussian radial basis function (RBF), i.e., \(K_{d}(x_i,x_j)=\exp ({-d(x_i,x_j)^2/(2\sigma ^2)})\). Most previous studies [42, 47, 68] use it in combination with a distance measure d selected from the time series domain. Following the methodology in Marteau and Gibet [42], we normalize the pairwise distance matrix in the training stage to limit the search space of the parameters. Specifically, we use a predefined set of C and sigma values to select the most appropriate value during a cross-validation process. Additionally, other kernels were evaluated [16, 56], but result of an unfeasible computational cost in the self-labeled context.

Table 4 Self-labeled methods ordered by the average accuracy results obtained in the transductive phase

Throughout the experimentation, we evaluate five different measures to compute the dissimilarity between instances: Euclidean, DTW, ERP, ACF and FFT. The Sakoe–Chiba band global constraint [54] is used to improve the performance of the elastic measures. Specifically, we fix the window size to 4 and 9% of the time series length for DTW and ERP, respectively, following the recommendation in Kurbalija et al. [36].

4 Results and discussion

This section presents the results obtained in the experiments and a detailed discussion of those. We evaluate the performance of the methods in two different settings: results obtained in transductive learning (Sect. 4.1) and inductive learning (Sect. 4.2), under three different ratios of labeled data. Section 4.3 presents an empirical analysis of the run-times obtained by the self-labeled techniques. Section 4.4 addresses the geometrical characteristics of the time series datasets and its influence in the performance of the techniques studied. A comparison between the supervised and semi-supervised learning paradigm is presented in Sect. 4.5. Finally, the discussion of all results is performed in Sect. 4.6.

We use nonparametric statistical tests to contrast the results obtained, following the methodology proposed in García et al. [27]. Concretely, we use the Aligned Friedman test [32] for multiple comparisons to detect statistically significant differences between the evaluated methods and the post-hoc procedure of Hochberg [31] to characterize those differences. In comparisons with only two algorithms involved, we use the Wilcoxon signed rank test, following the recommendation in Demšar [20].

4.1 Transductive results

As stated in Sect. 2.1, the main goal of transductive classification is to predict the class of the unlabeled data used during the training phase. Table 4 presents the average accuracy (Acc) results of the self-labeled methods involved in this study over the 35 datasets with 10, 20 and 30% of labeled data. The methods are presented in descending order of the accuracy. For those methods that support different base classifiers, we have explored all possible combinations specifying the classifier in the method’s name.

The difference between this ranking and the ranking obtained when we sort using the kappa results is denoted as \(\Delta \mathbf K \): positive values indicate that the method obtains a better position ranking by kappa, negative values indicate the opposite, and zero means no difference. \(\Delta K\) shows whether or not a certain method benefits from random hits. Table 4 reveals no significant difference between the two orders as only a handful of methods exhibit \(\Delta K\) values different from zero.

Figure 1 shows box and whisker plots of the methods under the dissimilarity measures studied. This illustration allows us to visualize in more detail the performance of the self-labeled methods. It shows the gain of accuracy caused by the use of DTW and ERP in comparison with the other measures. The superiority of DTW over Euclidean distance has been addressed in previous studies about time series classification problem. For instance, the extensive study performed by Serrà and Arcos [55] supports this conclusion. Furthermore, the study of Wang et al. [58] reveals that DTW and ERP are clearly superior to Euclidean distance. In this sense, our results confirm the advantage of these elastic measures in the semi-supervised context.

Fig. 1
figure 1

Box and whisker plots for the accuracy in transductive phase. The bottom and top of a box are the first and third quartiles. The band inside the box is the median. a 10% labeled data. b 20% labeled data. c 30% labeled data

On the other hand, the methods combined with 1NN as a base classifier exhibit the better performance. By contrast, the lowest results are obtained in combination with DT. Moreover, the use of SVM as a base classifier causes a spread behavior of the accuracy values. Figure 2 presents the average results in a bar plot aiming to compare the accuracy values across different labeled ratios.

Fig. 2
figure 2

Bar plot of the comparison between the average accuracy obtained during the transductive phase

We perform a comparison of the accuracy among all single learning methods grouped by their learning scheme. This comparison allows us to determine the most successful self-labeled methods for each base classifier.

The Aligned Friedman test, applied to accuracy for all methods that use 1NN as a base classifier, detects significant differences in a significance level of \(\alpha = 0.05\) for all comparisons performed. Table 5 shows the rankings obtained. The most accurate method is chosen as the control method for the application of the post-hoc procedure. For Euclidean and FFT distance, the method selected is SETRED in all labeled ratios. For DTW, ACF and ERP, the method selected is TriT in most of the comparisons. SNNRCE and SelfT show the lowest values of accuracy. In the majority of comparisons, these methods are significantly outperformed by the control method, following the Hochberg post-hoc procedure.

Table 6 shows the application of the Wilcoxon signed ranks test to contrast the accuracy of the methods that use DT and SVM as a base classifiers. For all dissimilarity measures and labeled ratios, TriT outperforms SelfT using both base classifiers. The difference obtained results significant on a significance level of \(\alpha = 0.05\), with the exception of ERP at 10% of labeled data.

Finally, Table 7 offers a comparison between the most competitive methods from single learning and the multi-learning approach (Democratic). We consider as “competitive” those methods that have not been significantly outperformed more than twice in the 15 comparisons performed (five distance measures \(\times \) three labeled ratios) in Tables 5 and 6. Following this criterion, the outstanding methods selected are: SETRED and TriT.

The Aligned Friedman test, applied to accuracy, detects significant differences on a significance level of \(\alpha = 0.05\) for all comparisons performed. For the dissimilarity measures DTW and ACF, the control method selected is TriT-1NN in most cases. For the dissimilarity measures Euclidean, FFT and ERP, Democratic is selected as control method in most of the comparisons. SETRED exhibits a competitive behavior because is not outperformed by the control method in any of comparisons. By contrast, TriT-SVM and TriT-DT are significantly outperformed by the control method in most of the comparisons performed, with the exception of ERP where TriT-SVM exhibits a better behavior.

Table 5 Aligned Friedman ranking of the accuracy using 1NN as a base classifier
Table 6 Wilcoxon signed ranks test of the accuracy for DT and SVM as a base classifiers
Table 7 Aligned Friedman ranking of the accuracy of the most competent methods for each learning approach using the dissimilarity measures studied

4.2 Inductive results

The main target of inductive learning is to classify instances not included in the training phase. This analysis is useful to test the previous learned hypotheses and their generalization abilities. Table 8 shows the average obtained using all dissimilarity measures studied. Figure 4 shows box and whisker plots of the same results grouped by ratios. Figure 3 shows a bar plot of the average accuracy reflecting the improvement obtained by increasing the amount of labeled examples.

Table 8 Self-labeled methods ordered by the average accuracy results obtained in the inductive phase
Fig. 3
figure 3

Bar plot of the comparison between the average accuracy obtained during the transductive phase

Fig. 4
figure 4

Box and whisker plots for the accuracy in inductive phase. a 10% labeled data. b 20% labeled data. c 30% labeled data

Once more, we perform a comparison of the accuracy among all single learning methods grouped by their learning scheme. The aligned Friedman test, applied to the accuracy of all methods that use 1NN as base learning scheme, detects significant differences for all comparisons performed. Table 9 shows the rankings obtained. The control method selected is SETRED in most of the comparisons except for ACF where TriT is selected in two labeled ratios. In general, SNNRCE and SelfT show the lowest values of accuracy and are significantly outperformed by the control method in most of the comparisons following the Hochberg post-hoc procedure.

Table 10 shows the application of the Wilcoxon signed ranks test to the accuracy for the methods that use DT and SVM as a base classifiers. For all dissimilarity measures and labeled ratios, TriT outperforms significantly SelfT using both base classifiers, with the exception of SVM under ERP at 10% of labeled data.

Finally, Table 11 offers a comparison between the most competitive methods from single learning and the multi-learning approach. Once more, the outstanding methods selected are: SETRED and TriT. The aligned Friedman test, applied to accuracy, detects significant differences for all comparisons performed. For the dissimilarity measures Euclidean, FFT and ERP, the control method selected is Democratic in all comparisons. For ACF, TriT-1NN is selected as control method in most of the comparisons. SETRED exhibits the best behavior under the dissimilarity measure DTW. TriT-SVM and TriT-DT are significantly outperformed in most of the comparisons by the control method, with the exception of FFT and ERP where TriT-SVM exhibits a competitive behavior.

4.3 Experimental run-times

From the temporal analysis performed in Sect. 2.1, it is clear that the main source of temporal complexity is related to the successive operations of training the model(s) and classifying instances. The cost associated with this operations directly depends on the learning scheme(s) used. In this section, we present an empirical analysis of the run-times based on a sample of 20 datasets included in the experimentation. All experiments were performed in a cluster conformed by 46 nodes, each one equipped with an Intel\(^{\circledR }\) Core\(^{TM }\) i7-930 processor at 2.8 GHz and 24 GB of RAM memory. Under GNU/Linux, we ran the experiments using R version 3.3.1 [48].

During the training and classification process, we provide the time series datasets to the self-labeled techniques using a distance matrix form. This avoids the repetitions of distance calculations, and therefore it reduces the running time. For that reason, the time consumption associated with the distance matrices computation are not considered in the current analysis.

Table 12 contains the run-times of the self-labeled techniques using a sequential execution of the fivefold cross-validation procedure. TriT-1NN obtains the lower run-times in all the cases. The 1-NN base classifier produces the shortest run-times, whereas the SVM produces the longest ones. This is caused by the time consumption involved in the construction of SVM with the threefold cross-validation process to adjust the parameters C and \(\sigma \). Democratic obtains expensive run-time because it trains a classifier for each learning scheme. SETRED and SNNRCE obtain competitive results as the base classifier trained is 1NN.

4.4 Impact of the geometrical characteristics of datasets

The geometrical characteristics of the datasets affect the performance of the self-labeled methods. The overlapping of samples is a common source of difficulty in the classification process. In addition, the reduced labeled sample in the SSC framework and the high dimensionality of time series data introduces another layer of complexity. In Wang et al. [59], the overlapping of samples from different classes is investigated in order to offer an explanation of the decreasing performance experimented by the SSC algorithms in datasets that suffer this problem.

We follow a similar idea by computing the neighbors of each training instance (from U) in a neighborhood graph constructed from \(L \cup U\). The proportion of neighbors (from L) with a different class with respect to the training instance analyzed, is used as an overlapping measure. Table 13 includes this overlapping measured averaged for all training examples of each dataset. The neighborhood graph was computed for each dissimilarity measure and proportion of labeled data. The datasets with high proportions of neighbors with different class are related to high levels of overlap between classes.

Table 9 Aligned Friedman ranking of the accuracy using 1NN as a base classifier
Table 10 Wilcoxon signed ranks test of the accuracy for DT and SVM as a base classifiers

In order to investigate the impact of the overlapping in the classification performance, we correlate the values of Table 13 with the accuracy obtained with the self-labeled techniques. Figure 5 shows the correlations obtained between the two variables studied (accuracy and overlapping). For all dissimilarity measures and proportions of labeled data studied, both variables present an strong inverse correlation. This means that the presence of overlapping in datasets affects, in a significant manner, the performance of the self-labeled techniques.

Considering the impact of overlapping in the techniques performance, we present a study about the tuning of some parameters based on the level of overlapping in the datasets. Specifically, we study the parameter significance threshold that controls the addition mechanism in the methods SETRED and SNNRCE. In the case of SETRED, this parameter controls the hypothesis used to decide if an specific example must be added or not to the labeled set L. The smaller this value, the more restrictive the selection of examples that are considered good is. In the case of SNNRCE, the significance threshold is related to the hypothesis used to determine if an example is considered as a “doubt example.” The greater this value, the more examples will be considered as doubt and accordingly to the SNNRCE method they will be relabeled.

Table 11 Aligned Friedman ranking of the accuracy of the competent methods for each learning approach using the dissimilarity measures studied
Table 12 Run-time means (s) associated with the training and testing process of the self-labeled techniques for each labeled percent evaluated
Table 13 The proportions of neighbors with different class computed for each training instance, using the information provided by a neighborhood graph
Fig. 5
figure 5

Spearman’s \(\rho \) correlation coefficients obtained between the overlapping estimated in the datasets and the accuracy results of the self-labeled techniques. a 10% labeled data. b 20% labeled data. c 30% labeled data

Fig. 6
figure 6

Best configuration of the significance threshold parameter for each dataset in the SETRED method. a Euclidean 10%, b Euclidean 20%, c Euclidean 30%, d DTW 10%, e DTW 20%, f DTW 30%, g ACF 10%, h ACF 20% and i ACF 30%

Fig. 7
figure 7

Best configuration of the significance threshold parameter for each dataset in the SETRED method. a FFT 10%, b FFT 20%, c FFT 30%, d ERP 10%, e ERP 20% and f ERP 30%

Fig. 8
figure 8

Best configuration of the significance threshold parameter for each dataset in the SNNRCE method. a Euclidean 10%, b Euclidean 20%, c Euclidean 30%, d DTW 10%, e DTW 20%, f DTW 30%, g ACF 10%, h ACF 20% and i ACF 30%

Fig. 9
figure 9

Best configuration of the significance threshold parameter for each dataset in the SNNRCE method. a FFT 10%, b FFT 20%, c FFT 30%, d ERP 10%, e ERP 20% and f ERP 30%

Fig. 10
figure 10

Bar plot of the accuracy gain for the best performing methods in inductive phase, using Euclidean distance. a 10% labeled data. b 20% labeled data. c 30% labeled data

Fig. 11
figure 11

Bar plot of the accuracy gain for SETRED in inductive phase, using Euclidean distance. a 10% labeled data. b 20% labeled data. c 30% labeled data

Fig. 12
figure 12

How far removed is semi-supervised from supervised results, using inductive learning under Euclidean distance. The upper gray lines represent the values of the baseline classifier trained on the fully labeled training set. a SETRED. b TriT-1NN. c Democratic

Figures 6 and 7 show the behavior of the significance threshold parameter throughout different levels of overlap. The significance threshold selected is the value, between three possible values (0.05, 0.10 and 0.15), that maximizes the accuracy of SETRED. In general, the most appropriated threshold for datasets seems to be the most restrictive value 0.05. This value benefits datasets with low overlapping. For other datasets, including those with medium or high degree of overlap are more flexible values of significance threshold preferred. This behavior is more noticeable at 30% of labeled data.

Figures 8 and 9 show a similar behavior for the SNNRCE method. For this method coincides as the most suitable option for the significance threshold the value 0.05. Although, for some datasets with more overlapping, the values 0.10 and 0.15 result a better option. In contrast to SETRED, this behavior is more noticeable at 10% of labeled data.

4.5 Can semi-supervised learning improve classification performance?

A recommendable procedure [12] in presence of labeled and unlabeled data is to start by learning a supervised classifier from the labeled data, named “baseline classifier.” A comparison with this classifier allows us to identify situations where the addition of unlabeled data causes performance degradation of the classifier obtained. In this section, we perform such analysis measuring the accuracy gain obtained with the addition of unlabeled data during the training phase. We estimate the accuracy gain subtracting the accuracy obtained with supervised classification from the accuracy obtained with SSC. In both cases, the classifier performance is evaluated on the testing set, using the same fivefold cross-validation scheme. We select as the baseline method the 1NN classifier because it offers the most accurate results.

We expect that the best performing self-labeled methods selected from previous sections will obtain the highest accuracy gain. For this reason, we focus this analysis on those methods. Figure 10 shows the accuracy gain obtained for each dataset using three semi-supervised methods. A negative gain means performance degradation of the classifier. We can observe a very diverse behavior of the gains under the different labeled ratios and methods. There are datasets that do not benefit from SSL, for instance ECG [60] and Wafer. The size of these datasets already causes that the hypothesis learned by the supervised baseline is perfectly capable of obtaining accurate classification results in the inductive phase. On other datasets, such as MedicalI, classification performance deteriorates with the addition of unlabeled data for 10 and 20% labeled ratio. This is the case as the initial labeled data provided are insufficient for training a correct model where unlabeled data will be truly beneficial. It is noticeable that MedicalI is a multi-class dataset (10 classes) with high overlapping. This adverse situation starts to reverse at 30% labeled ratio where Democratic obtains a positive accuracy difference gain.

Though there are unfavorable situations for SSL in some datasets, Fig. 11 shows the gains obtained with SETRED, in decreasing order. In general, at 20% labeled data a significant positive gain can be observed. We also see that it depends heavily on the ratio of labeled instances if there is benefit and how big it is. Summarizing, the circumstances that make SSL a suitable approach for a particular dataset depend on the modeling assumptions adopted for the classifier, as well as the characteristics of the time series data.

In addition to the performed analysis, we consider the baseline classifier trained on the fully labeled training set and evaluated on the testing set. These accuracy results can be considered as an upper bound for the self-labeled methods. Figure 12 shows the semi-supervised classification accuracy bounded by the baseline classifier. In general, the semi-supervised results in most of the datasets are competitive compared with the upper bound considered. Interestingly, for datasets such as StarLightC and Synthetic the multi-learning hypothesis, learned by Democratic, outperforms the upper bound classifier.

4.6 General discussion

This section gives a general discussion of the properties observed throughout this study. In addition, we highlight the methods that perform best in general.

  • For most of the methods, accuracy increases with an increase of labeled examples. However, this increase is usually rather moderate in most methods. In self-labeled methods with SVMs as base classifiers, the increase is bigger than in the other methods.

  • The classical SelfT is clearly outperformed by other self-labeled techniques independently of the learning approach and the dissimilarity measure used.

  • Usually, there is no difference between the obtained rankings with accuracy and kappa statistics. This means that there is no significant difference in the way that the classifiers benefit from random hits.

  • In general, 1NN offers the best transductive and inductive results as base classifier in most self-labeled methods. In addition, this base classifier yields the most competitive run-times in comparison with SVM and DT.

  • Although SVM and DT do not offer competitive results as base classifiers, when these learning schemes are combined with 1NN, following a multi-learning scheme (Democratic), good results are obtained. However, the increment of the run-time is a side effect of the Democratic method.

  • The use of DTW and ERP distance results in a gain of accuracy in comparison with the other measures studied. In the case of DTW, this advantage is reduced in presence of SVM as a base classifier. This behavior is caused by the indefiniteness of the kernels constructed under DTW.

  • SETRED, Democratic and TriT-1NN are the best performing methods in this study. TriT-SVM also exhibits a competitive behavior under FFT and ERP dissimilarity measures. For Euclidean, ERP and FFT, we recommend the use of Democratic. For DTW, we recommend TriT-1NN and SETRED for transductive and inductive scenarios, respectively. For ACF, we recommend TriT-1NN for either inductive and transductive learning.

  • The overlapping in datasets is an aspect that should be taken into account during the solution of time series classification problems. We find strong evidence about the negative effects of overlap in the accuracy of the self-labeled techniques.

  • From the study of the significance threshold parameter, we recommend in general the use of the most restrictive value 0.05 in both methods SETRED and SNNRCE. In particular, for datasets with high levels of overlap, other values of this parameter must be considered.

5 Conclusion

We have investigated the applicability of different self-labeled methods for semi-supervised learning in a time series context. In addition to the popular SelfT with 1NN as base learner, we have explored other combinations of self-labeled methods and learning schemes that had not been applied in a time series context, to the best of our knowledge. We can conclude with the following remarks:

  • In general, 1NN is a robust choice for the base classifier in the semi-supervised context as it offers the most accurate results and no parameters have to be tuned.

  • SelfT is always outperformed by other self-labeled methods such as TriT and SETRED.

  • Our empirical study allows us to highlight three methods, in particular SETRED, TriT-1NN and Democratic, that perform significantly better than the rest in terms of transductive and inductive capabilities.

  • The use of ensembles of classifiers (TriT-1NN and Democratic) is a very promising attempt to perform SSL in the time series context. This is in line with recent studies [3, 39] in supervised classification of temporal data.

  • Taking into account the underlying risk to classification performance caused by the addition of unlabeled data, we recommend a comparison of the SSL results with the 1NN as baseline classifier to identify the real benefits of learning with unlabeled data. The overlapping in datasets is other aspect that should be taken into account in the selection of the classification techniques. Specifically, the performance of the self-labeled techniques can be seriously affected by the presence of overlapping.