1 Introduction

In the last decades, the surface electromyography (EMG) signal has been widely investigated for the purpose of neuromuscular disorder diagnosis, rehabilitation and control of prosthetic devices as well as man–machine interface, targeting individuals with amputations or congenitally deficient limbs [2, 8, 17, 24, 26, 37, 41]. This is because the EMG signal provides a highly useful characterization of the neuromuscular system, also allowing that many pathological processes—whether arising in the nervous system or in the muscles—manifest themselves by alterations in the signal properties.

In order to accomplish the analysis and processing of EMG signals, mainly aiming at performing pattern classification, different approaches have been proposed in the literature, most of which are composed of two interdependent modules [15, 26]: (1) feature extraction and (2) classification. Feature extraction is especially helpful if the pattern to be represented is a sequence of values taken as a function of time, say x(t), such as the EMG signal. In general, there are four classes of feature extraction approaches to representing 1D signals, namely those based on time, frequency, time–frequency, and nonlinear dynamics.

It has been shown that biomedical signals, such as the EMG, are inherently nonlinear in nature, exhibiting well-defined properties, such as scale invariance, scaling range, power law scaling, and self-similarity [14, 38]. The phenomenon of self-similarity, in particular, whereby a small scale structure can resemble the large-scale structure of an object, has been investigated for the purpose of characterizing different biomedical signals as well as for identifying different patterns available in these signals [25, 32]. In fact, EMG signals usually show noticeable traces of self-similarity that could be captured by fractal dimension (FD) measures [22], representing a way to extract discriminative features directly from these signals [13]. Grossly speaking, FD amounts to a non-integer or fractional dimension of a geometric object [4, 44].

In [33], among different nonlinear methods investigated for representing EMG signals, fractal dimension was found to be especially interesting for its sensitiveness to the magnitude and rate of the generated muscle force. On the other hand, in the work of Hu et al. [22], FD was calculated from filtered surface EMG signals in order to discriminate between forearm supination (FS) and forearm pronation (FP) movements. The results reported by the authors showed that the values of fractal dimension of filtered FS surface EMG signals and those of filtered FP surface EMG signals distribute in two different regions, demonstrating the usefulness of FD in capturing different motion patterns of surface EMG signals. More recently, Phinyomark et al. [34] have investigated the specific case of low-level EMG signal classification through a single-channel system, which comes to be a difficult pattern classification task. The authors concluded that detrended fluctuation analysis (DFA), which is an advanced fractal analysis method suited for the identification of low-level muscle activations, performs better than other conventional features in the classification of EMG signals from bifunctional movements, such as flexion–extension. By other means, Ancillao et al. [3] have conducted an experimental study investigating the correlation between the fractal dimension of the surface EMG signal recorded over the main erector muscle of the human leg, viz. the rectus femoris muscle, during a vertical jump and the height reached in that jump. The authors concluded that FD is able to properly characterize the EMG signal, and a linear regression analysis showed a very high correlation coefficient between the fractal dimension and the height of the jump achieved by all the 20 healthy subjects recruited.

Regarding the classification stage, this can be briefly defined as the process of assigning one out of C discrete labels (classes) for a given input vector \(\varvec{x}\) [5]. The classification of EMG signals, in particular, appears to be a hard pattern recognition task to pursue since there are usually lots of interferences and fluctuations happening in the EMG signal [21]. Numerous empirical studies have been conducted investigating the use of different types of classifiers operating on different types of features extracted from the EMG signal. These classifiers include artificial neural networks (ANN) [9], linear and quadratic discriminant analysis [6, 35], Bayesian classifiers [16], fuzzy classifiers [7], and also support vector machines (SVM) [12, 28, 31, 45]. In a recent work [46], Yousefi and Hamilton-Wright conducted a critical review of some of the classification methodologies used in EMG characterization and also present the state-of-the-art accomplishments in this field, emphasizing neuromuscular pathology.

Most of the aforementioned classifiers are based on the idea of solely minimizing the training error, which is usually called empirical risk. However, the combination of limited amounts of training data and the quest for high classification accuracy over these data often leads to overfitting problems [5]. In addition, the levels of accuracy exhibited by these classifiers are usually much sensitive to the feature dimension of the given pattern set. Since they are not plagued by these deficiencies, SVM appear as the method of choice in coping with highly complex classification problems, such as those involving biomedical signals.

The relevance vector machines (RVM) were introduced by Tipping [42] as a Bayesian variant of SVM, which means that they also do not suffer from the aforementioned drawbacks. The RVM yield a probabilistic sparse model identical in functional form to the SVM, representing a new approach to pattern classification that has recently attracted a great deal of interest. In many problems, RVM classifiers have produced competitive results to other kernel-based classifiers, being recently thoroughly investigated in the context of electroencephalogram (EEG) signal classification for epilepsy diagnosis [29, 30].

In order to deal directly with multiclass classification problems, the RVM formulation has been recently adapted [36]. A straightforward multiclass adaptation of RVM is problematic due to the bad scaling of the maximization of the marginal likelihood procedure with respect to the number of classes [10] and dimensionality of the Hessian required for the Laplace approximation [5]. In [36], Psorakis et al. conceived an approach to circumvent these difficulties, bringing about two multiclass multikernel RVM methods (hereafter referred to as mRVM) that are able to address multikernel learning while producing both sample-wise and kernel-wise sparse solutions.

In this paper, we investigate the conjoint use of RVM and FD for tackling the task of EMG signal classification. For this purpose, besides the standard RVM formulation, two types of mRVM, namely constructive mRVM and top-down mRVM, as well as different methods for calculating the FD of an EMG signal, were considered. As far as the authors are aware of, this is the first work providing a thorough assessment of the potentials of combining RVM and FD into a single EMG signal classification framework. Several experiments have been conducted on a dataset involving seven distinct types of limb motions, and the performance of distinct configurations of the RVM+FD approach is reported.

The rest of the paper is organized as follows. In Sects. 2 and 3, we present four methods for estimating the FD from a 1D signal and the mathematical formulations behind RVM and mRVM models, respectively. In Sect. 4, we characterize the EMG dataset used in the experiments and outline some procedures adopted for data preprocessing. We then present and discuss the results achieved by different configurations of the RVM+FD approach, taking as reference the performance delivered by SVM models. Finally, Sect. 5 concludes the paper and brings remarks on future work.

2 Fractal dimension

In a nutshell, fractal dimension alludes to a statistical index of complexity, indicating how the details in a given physical pattern (or object) change with the scale at which they are measured [1, 4]. The value of this index is usually a non-integer, fractional number, hence the designation of a fractal dimension. There are many notions of FD, and various algorithms have been proposed to compute them [44]. None of these methods, however, should be considered as universal, which justifies an empirical comparison of their abilities as feature extractors from EMG signals. In the following subsections, we outline the four methods adopted in our experiments.

2.1 Box-counting method

The idea behind the box-counting (BC) method is to apply successive hypercube grid coverings over a curve (e.g., an 1D signal), yielding as a result a value which is usually very similar to that produced by the Hausdorff Dimension, which is another standard method for calculating the FD [4]. Since in each iteration of the BC method, a finer covering is applied, the method is said to perform a finer and finer analysis on the fractal. Usually, when this method is used, the final FD measure is named as box-counting dimension.

For the calculus of the BC dimension, the successive coverings generated by the method are reflected on a log–log curve (a.k.a. BC curve), which is composed of points that represent the relation between the shrinking of the hypercubes and their occupation rates. The straight line that best approaches the BC curve represents the behavior of the observations from the signal under analysis. The power law of this curve (i.e., the slope of the straight line that best fits it) represents the BC of the fractal.

Formally speaking, the calculation of the BC dimension (D) is given by [4]:

$$\begin{aligned} D = \lim _{n \rightarrow \infty }\frac{\log (Nn(\varLambda ))}{\log (2^n)}, \end{aligned}$$

where \(\varLambda \in {\mathfrak{H}}({\mathfrak{R}}^m)\) is an attractor in the Euclidean metric space whose points are compact subsets of \({\mathfrak{R}}^m\); \(Nn(\varLambda )\) is the number of boxes intersecting the attractor; and n denotes the nth iteration of the process. Simply put, the BC method covers \({\mathfrak{R}}^m\) with a grid of boxes with lateral size equal to \(1/2^n\).

2.2 Higuchi’s method

As the former, the Higuchi’s method [19, 44] is iterative in nature. However, it is especially indicated to handle waveforms as objects. Consider \(s=\{s(1),s(2),\ldots ,s(N)\}\) as an epoch of the time series to be analyzed. Then, construct k new time series (aka sub-epochs) \(s_m^k\), each of which being defined as [44]

$$\begin{aligned} s_m^k= \left\{ s(m),s(m+k),s(m+2k),\ldots ,s \left( m+\left\lfloor \frac{(N-m)}{k} \right\rfloor k \right) \right\} , \end{aligned}$$

where N is the total length of the data sequence s; \(m=1,2,3,\ldots ,k\) indicates the initial time value; k indicates the discrete time interval between points (delay); and \(\lfloor \cdot \rfloor\) means the floor operator.

For each of the sub-epochs \(s_m^k\), the average length \(L_m(k)\) is computed as

$$\begin{aligned} L_m(k)= \frac{1}{k}\left\{ \frac{(N-1)}{\left\lfloor \frac{(N-m)}{k} \right\rfloor k}\sum _{i=1}^{\left\lfloor \frac{(N-m)}{k} \right\rfloor } \left| s(m+ik)-s(m+(i-1)k) \right| \right\} , \end{aligned}$$

where \((N-1)/\lfloor (N-m)/k \rfloor k\) is a normalization factor.

Then, the length of the epoch L(k) for the time interval k is computed as the mean of the k values, for \(m=1,2,\ldots ,k\), as given in Eq. (1). This procedure is repeated for each k, ranging from 1 to \(k_\mathrm{max}\) (\(k_\mathrm{max}=5\) in our experiments).

$$\begin{aligned} L(k)= \sum _{m=1}^k L_m(k). \end{aligned}$$
(1)

The total average length L(k), for scale k, is proportional to \(k^D\), where D is the FD of the curve describing the shape of the epoch as calculated by Higuchi’s method. Otherwise, if L(k) is plotted against k on a double-logarithmic scale, then the coefficient of the linear regression of this plot can be taken as an estimate of the FD of the epoch [19].

2.3 Katz’s method

Consider \(s(i)=(x_i,y_i )\), \(i=1,2,\ldots ,N\), where \(x_i\) are values of the abscissa and \(y_i\) are values of the ordinate. If the points s(i) and s(j) are represented as \((x_i,y_i)\) and \((x_j,y_j)\), respectively, the Euclidean distance between the points is computed as:

$$\begin{aligned} {\text{dist}}(s(i),s(j))=\sqrt{(x_i-x_j)^2+(y_i-y_j)^2}. \end{aligned}$$

According to the Katz’s method, the FD of the curve representing a time series can be defined as [27]:

$$\begin{aligned} D = \frac{\log (L)}{\log (d)}, \end{aligned}$$
(2)

where L is the total length of the curve or the sum of the Euclidean distances between successive points in the same curve, and d is the diameter estimated as

$$\begin{aligned} d = \max ({\text{dist}}(s(i),s(j))),\quad i,j=1,\ldots ,N. \end{aligned}$$

If there are no intersections of the curve, i can be set equal to 1 and d can be estimated as the maximum distance between the first sample and the farthest of all subsequent samples in \(s(i), i = 2,\ldots ,N\).

Obviously, d and L should be dimensionless number to calculate the logarithms in Eq. (2). However, this is not always true. Katz [27] proposed to normalize d and L by the length of the average step, defined as \(L/N_l\). In this way, Eq. (2) becomes

$$\begin{aligned} D = \frac{\log (N_l)}{\log (N_l)+\log (d/L)}, \end{aligned}$$
(3)

where \(N_l = N-1\).

2.4 Sevcik’s method

Let \(y_i, i = 1, \ldots , N\) be a set of values sampled from a signal between time zero and \(t_\mathrm{max}\) with sampling period \(\delta\). Suppose also that the waveform is submitted to a double-linear transformation that maps it into a unit square. Then, the normalized abscissa \(x_i^*\) and the normalized ordinate \(y_i^*\) of the square can be defined, respectively, as [40]

$$\begin{aligned} x_i^*= & {} \frac{x_i}{x_\mathrm{max}},\\ y_i^*= & {} \frac{y_i-y_\mathrm{min}}{y_\mathrm{max}-y_\mathrm{min}}, \end{aligned}$$

where \(x_\mathrm{max}\) (\(y_\mathrm{max}\)) denotes the maximum value of \(x_i\) (\(y_i\)), and \(y_\mathrm{min}\) is the minimum value of \(y_i\). Thus, the FD of the waveform can be approximated by [40]

$$\begin{aligned} D = 1+ \frac{\ln (L)}{\ln (2N_l)} \end{aligned}$$

where \(\ln\) is the natural logarithm, L is the length of the curve in the unit square and \(N_l = N-1\).

3 Relevance vector machines and their multiclass versions

As mentioned before, RVM can be regarded as a Bayesian variant of SVM, aimed at overcoming some of the SVM limitations [5, 30, 42]. In this section, we present the basic formulation underlying standard RVM classifiers and also the recently proposed multiclass versions [18, 36].

3.1 Relevance vector machines

The standard formulation of the RVM assumes, for a given input \(\varvec{x}_n\), that the error between the classifier output, given by \(f(\varvec{x}_n;\varvec{w})\), and the desired output \(t_n\), where \(t_n \in \left\{ 0,1\right\}\), has a normal distribution with zero mean and variance \(\sigma ^2\). It also assumes that the samples \(\{\varvec{x}_i, t_i \}^N_{i=1}\) are independently generated, so that the likelihood of the observed dataset can be written as [42]:

$$\begin{aligned} p(\varvec{t} | \varvec{w},\sigma ^2 )=(2 \pi \sigma ^2)^{-N/2}\exp \left\{ -\frac{1}{2 \sigma ^2} || \varvec{t} - \varvec{\varPhi }\varvec{w}||^2 \right\} , \end{aligned}$$

where \(\varvec{t}=[t_1,\ldots ,t_N]^T\), \(\varvec{w}=[w_0,\ldots ,w_N]^T\), and \(\varvec{\varPhi } = [\varvec{\phi }(\varvec{x}_1), \ldots , \varvec{\phi }(\varvec{x}_N)]^T\), with \(\varvec{\phi }(\varvec{x}_i)=[1, K(\varvec{x}_i,\varvec{x}_1), \ldots , K(\varvec{x}_i,\varvec{x}_N)]^T\). The function \(K(\cdot ,\cdot )\) denotes a kernel function defined on a (high-dimensional) dot product space [39], whereas the final decision function is given by \(f(\varvec{x}_n;\varvec{w}) = \sum _{i=0}^N w_iK(\varvec{x}_i,\varvec{x}_n)\).

RVM uses an a priori probability over the model parameters (weights) controlled by a set of hyper-parameters. Each weight becomes associated with a hyper-parameter, and most likely values for the weights are estimated iteratively from the training data [42]. In a Bayesian perspective, the model parameters \(\varvec{w}\) and \(\sigma ^2\) can be estimated initially from an a priori distribution and then reestimated by calculating a posterior distribution using the observed data likelihood. Tipping [42] proposed the following a priori distribution for each model parameter:

$$\begin{aligned} p(w_j | \alpha _j ,\sigma ^2 )= \sqrt{\frac{\alpha _j}{2 \pi }} \exp \left\{ - \frac{\alpha _j w^2_j}{2} \right\} = {\mathcal {N}}(0,\alpha _j^{-2}), \end{aligned}$$

where \(j=0,\ldots ,N\) and \(\varvec{\alpha }=[\alpha _0,\ldots ,\alpha _N]^T\) is the hyper-parameter vector, which is estimated iteratively from the training data.

Given an a priori distribution, the Bayes rule can be used to determine the posterior distribution of the model parameters through \(p(\varvec{w},\varvec{\alpha },\sigma ^2| \varvec{t}) = p(\varvec{w} | \varvec{t}, \varvec{\alpha },\sigma ^2)p(\varvec{\alpha },\sigma ^2 | \varvec{t})\).

Moreover, for a new sample \(\varvec{x}_n\), the prediction of the corresponding label \(t_n\) can be provided by

$$\begin{aligned} p(t_n | \varvec{t}) = \int p(t_n | \varvec{w}, \varvec{\alpha },\sigma ^2)p(\varvec{w},\varvec{\alpha },\sigma ^2 | \varvec{t})d\varvec{w} d\varvec{\alpha } d\sigma ^2. \end{aligned}$$

However, an analytical expression for the posterior distribution of the model parameters is still not available. In order to solve this problem, it is necessary to adopt an effective approximation. The posterior distribution of the parameters can be decomposed into two components according to

$$\begin{aligned} p(\varvec{w},\varvec{\alpha },\sigma ^2 | \varvec{t}) = p(\varvec{w} | \varvec{t}, \varvec{\alpha },\sigma ^2)p(\varvec{\alpha },\sigma ^2 | \varvec{t}) . \end{aligned}$$
(4)

The first term of the right-hand side of Eq. (4) is the posterior probability of the weights \(\varvec{w}\) given \(\sigma ^2\) and \(\varvec{\alpha }\). The computation of these probabilities is well detailed in [42].

Once the weights were obtained, the hyper-parameters \(\alpha _i\) are updated according to \(\alpha _i = \frac{\lambda _i}{w^2_i}\), where \(w^2_i\) is the square of the ith average weight, \(\lambda _i\) is defined as \(\lambda _i = 1 - \sum _{ii}\), and \(\sum _{ii}\) is the ith element of the main diagonal of the covariance matrix, which may be interpreted as a measure of how well each parameter \(w_i\) is estimated. The optimization of the hyper-parameters continues until a pre-defined threshold is achieved or until certain number of iterations is performed.

Sparsity emerges when most of the \(\alpha _i\) go to infinity, thus effectively removing the corresponding basis functions; the remaining basis functions are called the relevance vectors (RV) [42]. For large-scale problems, this number can be high and testing complexity might become prohibitive, namely \(O(N_\mathrm{ts}N_\mathrm{RV})\), where \(N_\mathrm{ts}\) is the number of samples in the test set and \(N_\mathrm{RV}\) is the number of relevance vectors.

Standard RVM models can be used to handle classification problems with multiple classes by decomposing the problem into several binary classification tasks, each solved efficiently by an RVM model. The simplest approach, known as the one-versus-one approach, is to decompose the problem with C classes into \(\frac{C(C-1)}{2}\) binary problems. A binary classifier is built to discriminate between each pair of classes, while discarding the rest of the classes. When testing a new sample, a voting is performed among the classifiers and the class which received the most votes is deemed to be the outcome.

3.2 Multiclass relevance vector machines

Two different types of mRVM were proposed in [18, 36], namely the constructive type (referred to as mRVM1) and the top-down type (mRVM2). The idea of both is not to train multiple RVM classifiers but to train only a single model that could deal directly with multiclass problems. While mRVM1 achieves sparsity by starting with an empty model and adding samples from the training set based on their contribution to the model, the strategy underlying mRVM2 is to follow a top-down strategy by loading the whole training kernel into memory and iteratively removing non-relevant samples.

The training phase of mRVM2 is similar to that of mRVM1, being both based on the expectation maximization (EM) algorithm. The main difference in that is the mRVM2 does not adopt the marginal likelihood maximization as mRVM1 does [see Eq. (5)] but rather employs an extra E-step for the updates of the hyper-parameters [18]. Moreover, mRVM2 is relatively more expensive than mRVM1 because each sample i has different scales \(\alpha _{ic}\) across classes. However, if mRVM2 prunes a sample, such sample cannot be reintroduced into the model. In what follows, we present the main equations underlying the formulation of the mRVM1. The reader is referred to [36] for more detailed explanations.

Consider a training set \(\left\{ \varvec{x}_n,t_n\right\} _{n=1}^N\), where \(\varvec{x}_n \in {\mathfrak{R}}^m\) and \(t_n \in \left\{ 1,\ldots ,C\right\}\). Let \(\varvec{k}_n\) be the nth row of the kernel matrix \(\varvec{K}\) (\(\varvec{K} \in {\mathfrak{R}}^{N \times N}\)), expressing how the nth sample correlates with the others from the training set. The learning process involves the inference of the model parameters \(\varvec{W} \in {\mathfrak{R}}^{N\times C}\) in such a way that the quantity \(\varvec{W}^T\varvec{K}\) acts as a sort of voting system expressing which data relationships are important to capture for increasing the model’s discriminative properties.

Moreover, let \(\varvec{Y} = \left\{ y_{11},\ldots ,y_{1N};\ldots ;y_{c1},\ldots ,y_{cN};\ldots ;y_{C1},\ldots ,y_{CN}\right\} \in {\mathfrak{R}}^{C \times N}\) denote a matrix of auxiliary variables introduced for the purpose of multiple class discrimination, acting as targets for \(\varvec{W}^T\varvec{K}\). The variables \(y_{cn}\) are assumed to obey a standardized noise model, i.e., \(y_{cn} | \varvec{w}_c\), \(\varvec{k}_n \sim {\mathcal {N}}_{y_{cn}}(\varvec{w}_c^T\varvec{k}_n,1)\), whereas the model parameters \(w_{nc}\) follow a standard zero-mean Gaussian distribution, namely \(w_{nc} \sim {\mathcal {N}}(0,1/ \alpha _{nc})\), where \(\alpha _{nc}\) belongs to the scaling matrix \(\varvec{A} = (\varvec{\alpha }_1, \ldots , \varvec{\alpha }_N)^T \in {\mathfrak{R}}^{N\times C}\).

The formulation of mRVM1 adopts as objective the maximization of the marginal likelihood \(p(\varvec{Y} | \varvec{K}, \varvec{A} ) = \int p(\varvec{Y} | \varvec{K}, \varvec{W})p(\varvec{W} | \varvec{A}) d\varvec{W}\). In order to differentiate this likelihood, Psorakis et al. [36] followed the assumption that each sample n has a common scale \(\alpha _n\) shared across all classes. So, for mRVM1, the vector of hyper-parameters \(\varvec{\alpha }_n\) associated with a sample turns out to be a simple scalar \(\alpha _n\). The maximization of the marginal likelihood results in a criterion to either add a sample n, delete it, or update its associated \(\alpha _n\). So, the model can start with a single sample and then proceed in a constructive manner.

In order to achieve this goal, the log of the marginal likelihood is decomposed into contributing terms based on each sample, that is,

$$\begin{aligned} {\mathfrak{L}}(\varvec{A})= & {} \log p(\varvec{Y} | \varvec{K}, \varvec{A})\nonumber \\= & {} \sum _{c=1}^C -\frac{1}{2} \left[ N \log 2 \pi + \log |\varvec{{\mathcal {C}}}| + \varvec{y}_c^T \varvec{{\mathcal {C}}}^{-1}\varvec{y}_c\right] , \end{aligned}$$
(5)

where \(\varvec{{\mathcal {C}}} = \varvec{I} + \varvec{K}\varvec{A}^{-1}\varvec{K}^T\), whose determinant and inverse were derived by Tipping and Faul [43] as a function of \(\varvec{{\mathcal {C}}}_{-i}\), that is, the value of \(\varvec{{\mathcal {C}}}\) with the ith sample removed. The determinant of \(\varvec{{\mathcal {C}}}\) is given by

$$\begin{aligned} | \varvec{{\mathcal {C}}} | = |\varvec{{\mathcal {C}}}_{-i}| | 1 + \alpha _i^{-1}\varvec{k}_i^T \varvec{{\mathcal {C}}}_{-i}^{-1} \varvec{k}_i |, \end{aligned}$$

whereas the inverse of \(\varvec{{\mathcal {C}}}\) is given by

$$\begin{aligned} \varvec{{\mathcal {C}}}^{-1} = \varvec{{\mathcal {C}}}_{-i}^{-1} - \frac{\varvec{{\mathcal {C}}}_{-i}^{-1} \varvec{k}_i \varvec{k}_i^T \varvec{{\mathcal {C}}}_{-i}^{-1}}{\alpha _i + \varvec{k}_i^T \varvec{{\mathcal {C}}}_{-i}^{-1} \varvec{k}_i}. \end{aligned}$$
(6)

Equipped with these results, Eq. (5) can be rewritten as:

$$\begin{aligned} {\mathfrak{L}}(\varvec{A}) = {\mathfrak{L}}(\varvec{A}_{-i}) + \sum _{c=1}^C -\frac{1}{2} \left[ \log \alpha _{ic} - \log (\alpha _i + s_i) + \frac{q^2_{ci}}{\alpha _i + s_i}\right] , \end{aligned}$$

where \(s_i\) and \(q_{ci}\) are called sparsity factor and quality factor, respectively, and these are defined as \(s_i = \varvec{k}^T_i \varvec{{\mathcal {C}}}_{-i}^{-1} \varvec{k}_i\) and \(q_{ci} = \varvec{k}^T_i \varvec{{\mathcal {C}}}_{-i}^{-1} y_c\). The sparsity factor can be seen as a measure of how much the descriptive information of the ith sample is already captured from the existing samples. On the other hand, the quality factor measures how good the ith sample is in helping to describe a specific class [36].

By setting the derivative \(\partial {\mathfrak{L}}(\varvec{A})/ \partial \alpha _i = 0\), one obtains

$$\begin{aligned} \alpha _i&= \frac{Cs^2_i}{\sum ^C_{c=1}q_{ci}^2-Cs_i}, \quad \text{ if } \sum \nolimits ^C_{c=1}q_{ci}^2>Cs_i \end{aligned}$$
(7a)
$$\begin{aligned} \alpha _i&= \infty , \quad \text{ if } \sum \nolimits ^C_{c=1}q_{ci}^2\le Cs_i. \end{aligned}$$
(7b)

The quantity \(\theta _i = \sum ^C_{c=1}q_{ci}-Cs_i\) captures the contribution of the ith sample to the marginal likelihood in terms of how much additional descriptive information it provides to the model. By resorting to this quantity, it is possible to establish some rules for including or excluding a given sample, or updating its hyper-parameter [18]:

  • IF \(\theta _i>0\) and \(\alpha _i<\infty\) THEN set/update \(\alpha _i\) with (7a);

  • IF \(\theta _i\le 0\) and \(\alpha _i<\infty\) THEN set \(\alpha _i\) with (7b).

Then, the M-step and E-step of EM are used to estimate \(\varvec{W}\) and the posterior expectations of the auxiliary variables \(\varvec{Y}\), respectively. The weights are estimated as:

$$\begin{aligned} \hat{\varvec{w}}_c = (\varvec{K}\varvec{K}^T + \varvec{A}_c)^{-1}\varvec{K}\tilde{\varvec{y}}_c^T. \end{aligned}$$

Assuming a given class i, the E-step calculates the expected value of \(y_{in}\) as

$$\begin{aligned} \tilde{y}_{in} = \hat{\varvec{w}}_i^T \varvec{k}_n - \left( \sum _{j\ne i} \tilde{y}_{jn} - \hat{\varvec{w}}_j^T \varvec{k}_n \right) , \end{aligned}$$

whereas \(\forall c \ne i\), the E-step yields

$$\begin{aligned} \tilde{y}_{cn} \leftarrow \hat{\varvec{w}}_c^T\varvec{k}_n - \frac{{\mathcal {E}}_{p(u)}\left\{ {\mathcal {N}}_u(\hat{\varvec{w}}_c^T\varvec{k}_n - \hat{\varvec{w}}_i^T\varvec{k}_n,1)\varPhi _u^{n,i,c}\right\} }{{\mathcal {E}}_{p(u)}\left\{ \varvec{\varPhi }_u(u+\hat{\varvec{w}}_i^T\varvec{k}_n - \hat{\varvec{w}}_c^T\varvec{k}_n)\varPhi _u^{n,i,c}\right\} }, \end{aligned}$$

where \(u \sim {\mathcal {N}}(0,1)\) and \(\varvec{\varPhi }\) denotes the Gaussian cumulative distribution function.

In the classification phase, the test sample \(\varvec{x}_n\) is labeled as of the class i whose auxiliary variable \(y_{in}\), \(1 \le i \le C\), is maximum, i.e., \(t_n = \arg \max _i (y_{in})\).

4 Computational experiments

In what follows, we provide details about the dataset used in the experiments and how the experiments were set up. Then, we present the accuracy results revealed by the RVM and mRVM models, considering the different methods to calculate the fractal dimension. For each model, we also report the optimized kernel parameter value and the associated number of relevance vectors, so as to measure the complexity of the induced models. In this paper, the one-versus-one approach was adopted when using the standard RVM approach.

4.1 Description of the dataset

The EMG signal dataset used in our experiments was originally collected by Chan and collaborators [6, 17]. The authors used eight channels of surface EMG to collect signals from the right arm of 30 normally limbed subjects. Each subject underwent four sessions, with one to two days of separation between sessions. Each session consisted of six trials. EMG signals were collected from seven sites on the forearm and one site on the biceps. An electrode was placed on the wrist to provide a common ground reference. These signals were amplified with gain of 1000 and a bandwidth of 1 Hz to 1 KHz. Signals were sampled at 3 KHz using an analog-to-digital converter.

Seven distinct limb motions (classes) were performed: hand open, hand close, supination, pronation, wrist flexion, wrist extension, and rest. For each trial, the subject repeated each limb motion four times, holding each motion for 3 s, each time. The order of these limb motions was randomized. Chan and Green [6] only used the session four in their experiments. Data from the first two trials were used as training data, and data from the remaining four trials were used as testing data. In this paper, we also make use of data from session four, but the investigated models were assessed separately on each trial using \(5\times 2\) cross-validation.

4.2 Experimental setup

The main purpose of this paper is to empirically assess the performance of RVM models in the task of EMG signal classification. In the experiments, we have considered only the radial basis function kernel [39], which has an associated hyper-parameter to be calibrated beforehand, namely the radius \(\sigma\). The value of this parameter was varied in our experiments. Although we know that there are several heuristics to select the values of hyper-parameters, we have opted to set the value of \(\sigma\) as one in the range \(\{2^i, i = -3, -2, -1, 0, 1, 2, 3, 4, 5\}\). For each of the nine values in this range, a \(5\times 2\)-fold cross-validation run per trial was performed in order to measure the average performance of the methods.

In what concerns data preprocessing, samples were extracted from the EMG signals using a sliding window of 256 ms in length, spaced 32 ms apart [15]. Then, the FD values, as calculated by the different methods described in Sect. 2, were used to build up the feature vectors. The dimension of each transformed sample (i.e., feature vector) was of eight features, since there were eight channels and one FD value was calculated for each channel. The class distribution for each trial is presented in Table 1.

Table 1 Class distribution for each trial
Table 2 Best results in terms of cross-validation error achieved for each feature type and kernel machine

4.3 Simulation results

In Table 2, we report the best accuracy results achieved by the different kernel machines, including SVM, considering the four types of FD features. The results are given in terms of average and standard deviation of the generalization error calculated over the \(5\times 2\)-fold cross-validation process. In this table, we highlight the best calibrated kernel parameter value for each kernel machine and also present the number of relevance vectors or support vectors associated with each model. The accuracy results are complemented with those reported in Table 3, which relates to the application of the two-sided Wilcoxon rank sum test over the cross-validation errors [11]. The Wilcoxon rank sum test is a nonparametric statistical procedure that helps answering the following question: Do two independent samples, say \(\mathbf {x}\) and \(\mathbf {y}\), represent two different populations? The null hypothesis is that data in \(\mathbf {x}\) and \(\mathbf {y}\) are samples from continuous distributions with equal medians. Assuming a 5 % significance level, having a p-value lower than 0.05 indicates that the Wilcoxon rank sum test rejects the null hypothesis, and so the difference in performance between the given kernel machines is statistically significant [20]. In our case, the test is applied per trial and one of the samples always relates to the kernel machine with the lowest average cross-validation error for the given trial.

On the other hand, Tables 4 and 5 show the specificity and sensitivity values delivered by the best calibrated kernel machines, as reported in Table 2, for each combination of FD method and trial. Each of the last 14 columns in these tables refers to either a specificity or a sensitivity result for a certain class. Sensitivity (also called the true positive rate) measures the proportion of actual positives of a class which are correctly identified as such, whereas specificity (aka the true negative rate) measures the proportion of negatives of a class which are correctly identified as such.

The features were normalized to have 0 as mean and 1 as standard deviation. Since the accuracy results produced by using the feature values extracted by the Katz’s method were significantly better than those obtained by using the feature values extracted by the other FD methods, we decided to inspect in more detail the effect of the calibration of the kernel parameter value for the cases where the Katz’s method was employed. Thus, Figs. 1, 2, 3 and 4 show the way the accuracy rate (i.e., 1—error rate) obtained by the different kernel machines has varied as a function of the kernel parameter value. Figures 5, 6, 7, and 8 do the same job but focus on the sensitivity. The choice of the trial #1 was arbitrary since the purpose here is only to contrast the profiles produced by the different machines. The bars in Figs. 1, 2, 3, and 4 represent the variance in accuracy rate per class (one standard deviation from the mean) for each value of \(\sigma\) considered.

Finally, in Tables 6 and 7, we provide the average processing time elapsed during the training and testing phases for each combination of classifier model, fractal dimension estimation method, and experimental trial.

4.4 Discussion

From the results presented in Tables 2 and 3, it is possible to conclude that, in general, the accuracy rates displayed by SVM and RVM were rather similar to each other, prevailing in the majority of the cases over those produced by mRVM2. On the other hand, the performance of mRVM1 varies in accordance with the feature extractor adopted. Considering specifically the BC and Sevcik’s methods, SVM and RVM usually outperformed the others, as testified by the low p-values associated with mRVM1 and mRVM2. For Higuchi’s method, SVM performed consistently better than mRVM1 and mRVM2, but was comparable to RVM in most of the cases. On the other hand, irrespective of the type of kernel machine, the accuracy rates obtained with the aforementioned FD methods were significantly worse than those achieved with the Katz’s method. For this feature type, the performance levels delivered by SVM, RVM, and mRVM1 were rather comparable, since the null hypothesis could not be rejected in five out of six trials. In half of the trials, mRVM1 has provided the best average results, whereas in all cases, mRVM2 was overmatched by the best kernel machine. It is also worth mentioning that the standard deviation values of the error rates obtained with the Katz’s method were usually smaller for all machines, evidencing the robustness of the induced models to the variability of training/test data in the cross-validation process.

In what concerns the efficiency of FD-based RVM and its variations in terms of computational time, the results shown in Tables 6 and 7 reveal that the training of these models is usually more expensive than that of SVM. However, in the testing phase, this time reduced from 2 s on average for SVM to circa 1.5 s on average for RVM and to about 0.4 s on average for mRVM1 and mRVM2. This suggests that the FD-based multiclass RVM can yield more sparse solutions, which means a better data reduction ability. Anyway, regardless of the FD estimation technique used, the time taken to obtain the final classification outputs from the induced RVM models is usually small, which ensures their practical deployment in real-world settings.

Table 3 Results of the Wilcoxon rank sum test over the cross-validation errors
Fig. 1
figure 1

Variation of accuracy rate per class as a function of the kernel parameter value for SVM using trial #1 and feature vector extracted through the Katz’s method

Fig. 2
figure 2

Variation of accuracy rate per class as a function of the kernel parameter value for RVM using trial #1 and feature vector extracted through the Katz’s method

Fig. 3
figure 3

Variation of accuracy rate per class as a function of the kernel parameter value for mRVM1 using trial #1 and feature vector extracted through the Katz’s method

Fig. 4
figure 4

Variation of accuracy rate per class as a function of the kernel parameter value for mRVM2 using trial #1 and feature vector extracted through the Katz’s method

Fig. 5
figure 5

Variation of sensitivity per class as a function of the kernel parameter for SVM using trial #1 and feature vector extracted through the Katz’s method

Fig. 6
figure 6

Variation of sensitivity per class as a function of the kernel parameter for RVM using trial #1 and feature vector extracted through the Katz’s method

Fig. 7
figure 7

Variation of sensitivity per class as a function of the kernel parameter for mRVM1 using trial #1 and feature vector extracted through the Katz’s method

Fig. 8
figure 8

Variation of sensitivity per class as a function of the kernel parameter for mRVM2 using trial #1 and feature vector extracted through the Katz’s method

Table 4 Best specificity (Spec) results achieved by models for each class and FD method using trial #1
Table 5 Best sensitivity (Sens) results achieved by models for each class and FD method using trial #1
Table 6 Average CPU time (in seconds) spent in the training phase for each combination of FD estimation method, classifier type, and experimental trial
Table 7 Average CPU time (in seconds) spent in the test phase for each combination of FD estimation method, classifier type, and experimental trial

By looking at the values shown in Tables 4 and 5, one can perceive that the use of the Katz’s method as feature extractor has endowed all classifiers with the capability to provide a good balance between specificity and sensitivity of the classes. In fact, for all FD methods but Katz’s, the specificity values were usually significantly lower than the sensitivity values. Besides, as evidenced in Figs. 5, 6, 7, and 8, very high sensitivity values could be obtained for all seven classes, irrespective of the value used for the kernel parameter. This behavior could not be reproduced by the other FD methods.

The choice of the kernel parameter value was not a crucial factor to distinguish between the overall best error rates exhibited by the models, even though for each kernel machine, there are some values of \(\sigma\) that appear more frequently in Table 2, such as \(\sigma =8\) for SVM and \(\sigma =\{2,4\}\) for RVM. As depicted in Figs. 1, 2, 3, and 4, there is usually a range of values for the kernel parameter yielding quite interesting results, although there is no optimal value yielding 100 % of correct classification for all classes. Interestingly, the best values of \(\sigma\) for the combination of mRVM1 and Katz’s method were always the same, namely \(\sigma =32\), the highest value of the studied range. Maybe higher values of this parameter could yield even better results for the mRVM1. In terms of stability, RVM models were usually more robust to the choice of \(\sigma\), considering the mean accuracy over all classes altogether.

Finally, in what regards the complexity of the induced models, the number of support vectors and relevance vectors of the best calibrated SVM and RVM models was usually significantly higher than the number of RV associated with mRVM models—refer to Table 2. An exception occurs for the combination of mRVM1 and Katz’s method. In this case, the number of RV was much higher than those obtained by using the other methods for calculating the FD. On the other hand, the models induced by mRVM2 were always the less complex ones, regardless of the FD method. So, when the sparsity of the induced model is a key aspect to take into account, the use of mRVM2 seems to be much recommended.

5 Concluding remarks

In this paper, we investigated the potentials of using relevance vector machines (both in the standard and multiclass formulations) to cope with the task of EMG signal classification. In this study, we have considered different methods for calculating the fractal dimension of 1D signals as feature extractors.

Through experiments conducted on a publicly available dataset involving different types of limb movements (seven classes in total), we have empirically confirmed that the deployment of the kernel machines equipped with the FD feature values can be useful for achieving good levels of classification performance. In particular, the combination of SVM, RVM, and mRVM1 with Katz’s method was the best, across the different experiment trials, in terms of accuracy and generalization. In what concerns the complexity issue, however, mRVM2 has consistently produced more sparse models, implying higher efficiency when classifying large batches of novel samples.

As ongoing work, we are currently extending the scope of investigation by considering other nonlinear dynamics methods to extract the hidden information in the EMG signals, such as the Lyapunov exponent, and Hurst exponent [2]. As future work, we plan to investigate the impact of using EMG sub-segments of different sizes and also of using different feature selection methods, since feature selection is a preprocessing step that can bring about gains in terms of classifier accuracy [23, 45]. Finally, the combination of different kernel machines in heterogeneous committee machines will also be researched in the context of EMG signal classification.