Keywords

1 Introduction

In recent years, the classification problem with imbalanced data has received considerable attention in areas such as Machine Learning and Pattern Recognition. A two-class data set is said to be imbalanced when one of the class (the minority class) is heavily under-represented in comparison to the other class (the majority one) in the training dataset. In such situations, it is costly to misclassify activities from the minority class but the learning system may have difficulties to learn the concepts related to such activities, and therefore, results in the classifier’s suboptimal performance.

This paper deals with the problem of imbalanced data to assist sick or elderly people in performing daily life activities [1] such as cooking, brushing, dressing, and so on. Activity recognition datasets are generally imbalanced, meaning certain activities occur more frequently than others. These differences may correspond to how often an activity is performed, e.g. leaving is generally done once a day, while toileting is done several times a day, or to the number of time slices an activity takes up, e.g. of leaving activity generally takes up considerably more time slices than a toileting activity.

In recent years, there have been several attempts to deal with the class imbalance problem [2, 3]. Traditionally, research on this topic has mainly focused on a number of solutions both at the data and algorithmic levels. At the data level [4], solutions include many different forms of re-sampling such as Over-Sampling (OS), Under-Sampling (US). At the algorithmic level, solutions include adjusting the costs associated with misclassification so as to improve performance [5]. In [6], we proposed a new version of Weighted Support Vector Machines (WSVM) setting different cost parameters for each activity employed to handle the imbalanced human activity datasets. In this paper, we propose a new classification model named OS-WSVM that combines the oversampling method with WSVM method to deal the class imbalance problem. The experiments were implemented on multiple annotated real world datasets from sensor readings in different houses [7, 8].

2 Proposed Approach

2.1 System Overview

The main idea proposed in this paper is to entirely determine the boundary of datasets by the support vectors. Therefore Over sampling (OS) is only applied in the support vectors obtained by WSVM learning. Through this process, the performance of Weighted SVM can be enhanced in the imbalanced datasets. Moreover, the approach can reduce the processing time because the number vectors are bounded and become small. According to the proposed idea, the new algorithm can be expressed as follows (Fig. 1):

Fig. 1.
figure 1

Block diagram of the proposed activity recognition approach.

Step1::

Use the Weighted SVM to deal with the imbalanced training datasets, and record the support vectors.

Step2::

Sample the support vectors to improve the balanced degree between the majority class and the minority class by using Over Sampling technique.

Step3::

Use the SVM to deal with balanced datasets, and get the ultimate classifier.

The outcome of the trained SVM will then be used to process a new observation during the testing phase where the associated activities of daily living class will be predicted.

2.2 Over-Sampling (OS)

This approach increases the number of minority class samples. The simplest approach is Random oversampling, in which examples from the minority class are chosen randomly. Chosen examples are then duplicated from the minority class to the original set and added to the training data, which implies that no information is lost.

2.3 Support Vector Machines (SVM) [9]

For a two class problem, we assume that we have a training set \( \left\{ {\left( {{\text{x}}_{\rm{i}} ,{\text{y}}_{\rm{i}} } \right)} \right\}_{{{\rm{i}} = 1}}^{\rm{m}} \) where \( {\text{x}} \in {\text{R}}^{\rm{n}} \) and yi are class labels either 1 or –1. The primal formulation of SVM maximizes margin 2/K(w, w) and minimizes the training error ξi simultaneously by solving

$$ \begin{array}{*{20}l} {\mathop {\text{min} }\limits_{{{\text{w,b,}}\xi }} \quad 1/2.{\text{K}}\left( {\text{w,w}} \right) + {\text{C}}\sum\limits_{{{\text{i}} = 1}}^{\text{m}} {\xi_{\text{i}} } } \hfill \\ {{\text{subject}}\quad {\text{to}}\quad {\text{y}}_{\text{i}} \left( {{\text{w}}^{\text{T}} \phi \left( {{\text{x}}_{\text{i}} } \right) + {\text{b}}} \right) \ge 1 - \xi_{\text{i}} ,\xi_{\text{i}} \ge 0,\quad {\text{i = 1,}} \ldots ,{\text{m}}} \hfill \\ \end{array} $$
(1)

where w is normal to the hyperplane, b is the translation factor of the hyperplane and \( \varphi (.) \) is a non-linear function which maps the input space into a feature space defined by \( {\text{K}}({\text{x}}_{\text{i}} ,{\text{x}}_{\rm{j}} ) = \varphi ({\text{x}}_{\rm{i}} )^{\text{T}} \varphi ({\text{x}}_{\rm{j}} ) \). Solving dual formulation of Eq. (1) for the Lagrange multipliers \( \alpha \) gives a decision function for classifying a test point \( {\text{x}} \in {\text{R}}^{\rm{n}} \)

$$ {{\rm f}}({\rm{x}}) = {\rm sgn} \left( {\sum\limits_{{{{\rm i}} = 1}}^{{{{\rm m}}_{\rm{sv}} }} {\alpha_{{\rm i}} {{\rm y}}_{\rm{i}} {{\rm K}}({\rm{x}},{{\rm x}}_{\rm{i}} ) + {{\rm b}}} } \right) $$
(2)

with \( {\text{m}}_{\rm{sv}} \) is the number of support vectors \( {\text{x}}_{\rm{i}} \in {\text{R}}^{\rm{n}} \).

2.4 Weighted Support Vector Machines (WSVM) [10]

WSVM was presented to deal with the imbalanced problem by introducing two different cost parameters \( {\text{C}}_{ + } \) and \( {\text{C}}_{ - } \) in the SVM optimization primal problem [9] for the majority classes (yi = +1) and minority ones (yi = –1), as given in Eq. (1) below:

$$\begin{array}{*{20}l} {\mathop {\text{min} }\limits_{{{\text{w,b,}}\xi }} \quad 1/2.{\text{K}}\left( {\text{w,w}} \right) + {\text{C}}_{ + } \sum\limits_{{y_{\text{i}} = 1}} {\xi_{i} + {\text{C}}_{ - } \sum\limits_{{{\text{y}}_{{{\text{i = }} - 1}} }} {\xi_{\text{i}} } } } \hfill \\ {{\text{subject}}\quad {\text{to}}\quad {\text{y}}_{\rm{i}} \left( {{\text{w}}^{\rm{T}} \phi \left( {{\text{x}}_{\rm{i}} } \right) + {\text{b}}} \right)\; \ge \;1\; - \;\xi_{\text{i}} ,\;\xi_{\text{i}} \; \ge \;0,\quad {\text{i}}\;{ = }\; 1 ,\ldots , {\text{m}}} \hfill \\ \end{array} $$
(3)

\( {\text{C}}_{ + } \) and \( {\text{C}}_{ - } \) are cost parameters for positive and negative classes, respectively.

Some authors [10, 11] have proposed adjusting different cost parameters to solve the imbalanced problem. Veropoulos et al. in [11] proposed to increase the cost of the minority class (i.e. \( {\text{C}}_{ - } > {\rm{C}}_{ + } \)) to obtain a larger margin on the side of the smaller class. In [6], we proposed a new criterion to choose the cost parameters for WSVM algorithm. The coefficients are adapted for each class of activity and typically chosen as:

$$ {\text{C}}_{\rm{i}} = {\rm{C}} \times \left[ {{\text{m}}_{ + } /{\text{m}}_{\rm{i}} } \right],\quad \quad {\text{i}} = 1, \ldots ,{\text{N}} $$
(4)

where \( {\text{m}}_{ + } \) is the number of samples of majority class and mi is the number of samples of the other class. C is the common ratio misclassification cost factor of the WSVM. This parameter is determined with the cross validation method.

3 Simulation Results and Assessment

3.1 Datasets

We used fully labeled datasets [7, 8] gathered by a single occupant from three houses having different layouts and different non-intrusive sensor networks. Each network is composed of a different number of state-change sensors nodes such as reed switches to determine open-close states of doors and cupboards; pressure mats to identify sitting on a couch or lying in bed. The data was labelled using different ways for annotation. Time slices for which no annotation is available are collected in a separate activity labelled ‘Idle’. Table 1 shows the number of data per activity in each dataset.

Table 1. Annotated list of activities and the number of instances of each one.

3.2 Results

In this study, a software package LIBSVM [12] was used to implement the SVM multiclass classifier algorithm. First we optimized the hyper-parameters (σ, C) for all training sets in the range (0.1–2) and [0.1, 1, 10, 100] respectively to minimize the error rate of leave-one day-out cross-validation technique. Then locally, we optimized the cost parameter Ci adapted for each activity class by using WSVM [6] classifier with the common cost fixed parameter C = 1. The overall performance of our approach is compared with SVM, OS-SVM and WSVM and is summarized in Table 2. The results demonstrate that our approach outperforms other methods.

Table 2. Recall, Precision, F-Measure and Accuracy results for all approaches.

We show in Fig. 2 for the TK26M dataset, that OS-WSVM outperforms the other approaches for ‘Toileting’, ‘Showering’, ‘Breakfast’, ‘Dinner’ and ‘Drink’ activities and similar results with other methods for ‘Leaving’, and ‘Sleeping’ activities. The majority activities ‘Leaving’ and Sleeping’ are better for all methods while the ‘Idle’ activity is less accurate for the proposed method compared to other methods. Additionally, the kitchen-related activities as ‘Breakfast’, ‘Dinner’ and ‘Drink’ are in general harder to recognize than other activities.

Fig. 2.
figure 2

Accuracy recognition rate for each activity on TK26M dataset.

In order to quantify the extent to which one class is harder to recognize than another one, we analyzed the confusion matrix of OS-WSVM for TK26M dataset in Table 3. We noticed that the activities ‘Leaving’, ‘Toileting’, ‘Showering’, ‘Sleeping’, ‘Dinner’ and ‘Drink’ are better recognized comparatively with ‘Idle’ and ‘Breakfast’.

Table 3. Confusion Matrix (values in %) for OS-WSVM for the TK26M dataset.

The kitchen activities seem to be more recognized using the proposed method. In the TK26M house, there is a separate room for almost every activity. The kitchen activities are food-related tasks, they are worst recognized because most of the instances of these activities were performed in the same location (kitchen) using the same set of sensors. Therefore the location of sensors strongly influences the recognition performance.

4 Conclusion

Our experiments on real-world datasets from smart home environment showed that OS-WSVM strategy dealing with the class imbalance at the data and algorithmic levels can significantly increase the recognition performance to classify multiclass sensory data, and can improve the prediction of the minority activities.

In the future, it will be interesting to use the temporal features when the activity is performed to improve the activity classification performance. Also, the scalability of our approach will be further tested by considering datasets containing increased classes and various amounts of sensors.