Improved SMOTE Algorithm to Deal with Imbalanced Activity Classes in Smart Homes

Guo, Shikai; Liu, Yaqing; Chen, Rong; Sun, Xiao; Wang, Xiangxin

doi:10.1007/s11063-018-9940-3

Improved SMOTE Algorithm to Deal with Imbalanced Activity Classes in Smart Homes

Published: 29 October 2018

Volume 50, pages 1503–1526, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Processing Letters Aims and scope Submit manuscript

Improved SMOTE Algorithm to Deal with Imbalanced Activity Classes in Smart Homes

Download PDF

Shikai Guo¹,
Yaqing Liu^1,2,3,4,
Rong Chen¹,
Xiao Sun² &
…
Xiangxin Wang¹

1264 Accesses
52 Citations
Explore all metrics

Abstract

Performance of resident-activity-recognition systems is an important measure in the evaluation of smart homes performance. An imbalanced distribution of activity classes, however, severely degrades this performance. Traditional approaches towards realization of activity recognition focus on the improvement of recognition algorithms rather than imbalanced-data adjusting. Even state-of-the-art recognition algorithms have been limited to exclusively improving activity-recognition performance. The proposed study focuses on imbalanced-data adjusting and presents an improved Synthetic Minority Oversampling Technique (SMOTE) algorithm to address issues concerning imbalanced activity classes. Instead of linear interpolation, the proposed algorithm uses the Euclidean distance of each minor activity to adjust the distribution of activity classes, thereby generating new synthetic minority activities in the neighborhood of remaining minority-class examples. Two public datasets were utilized in this study to validate the improved SMOTE algorithm. Results demonstrate that the proposed approach favorably outperforms traditional SMOTE algorithms.

Human Activity Recognition in Smart Home Environment Using OS-WSVM Model

Comparing HMM, LDA, SVM and Smote-SVM Algorithms in Classifying Human Activities

Effects of Smart Home Dataset Characteristics on Classifiers Performance for Human Activity Recognition

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, the population of elderly people has witnessed substantial increase around the world. Consequently, investigations concerning methods to determine means to care of the elderly have drawn wide attention, and different policies have been adopted. In some countries, it is the responsibility of the government to provide for the elderly. This, in turn, increases the financial pressure upon governments. In other countries, citizens from younger generation look after their parents and elderly counterparts. This, however, puts the younger generation under considerable strain. To enable the aged to live independently as well as reduce dangers to a minimum, monitoring devices are being installed in homes. In this way, activities concerning daily living, such as sleeping, cooking, and eating, could be effectively recognized by analyzing data generated by monitoring devices.

Activity recognition involves prediction of resident activity via generated data. Over the past decade, activity recognition has received increased attention from numerous research groups using different kinds of monitoring devices, and source data generated by such devices could be categorized into five types [1], three of which are generated by sensors. The first category of sensor data is generated by body-worn sensors [2]. Residents are required to wear monitoring devices, by means of which their activities are recorded in real time. Although use this method protects resident privacy, the wearable devices could be considered an extra burden. The second category of sensor data is generated by pressure sensors [3] used to detect the position of residents seated on chairs, resting in bed, and performing sit-to-stand and stand-to-sit transitions. Monitoring technologies, at present, can only detect a few simple activities. For example, Shen et al. [4] proposed an efficient multilayer authentication protocol along with a secure session key generation method for wireless body area networks. Sun et al. [5] proposed a method based on the adaptive observation matrix to reduce errors incurred and facilitate complete and accurate reconstruction of sensor response signals, thereby facilitating accurately and completely [5]. Zhang et al. [6] proposed optimum cluster-based mechanisms based on a modified multi-hop layered model for load balancing via multiple mobile sinks. The third category of sensor data is generated by ambient sensors [7] placed in different rooms. Ambient sensors generally include light, temperature, and magnetic-door sensors. When residents move or perform an activity inside a room, ambient sensors get activated. The activity being performed by the resident is recognized based on sensor events. Resident motion activates a sequence of ambient sensors; for example, “washing” activates ambient sensors installed in taps. In this regard, Zhang et al. [8] investigated how mobile sensors could be efficiently relocated to achieve k-barrier coverage. Through use of such monitoring devices, residents can avail privacy protection and are freed from wearing additional devices. The fourth category of data involves video data generated by cameras [9]. Yu et al. [10,11,12] applied multimodal technology to human pose recovery. As observed in their research, resident activities could be recognized using video analysis and processing techniques. Although camera-based approaches are criticized for privacy breach, video-processing techniques have been recently introduced to anonymize and perform recordings only under situations wherein the user may encounter potential danger. The last category of data refers to sonic data generated by residents or objects, examples of which include, the sound of dish washing or falling of an object or person [13]. However, a major limitation in the use of sonic data for activity recognition is that it may be easily interfered by stray noise.

The proposed study focuses on activity recognition using data produced by ambient sensors. A number of approaches have been proposed to improve the performance of activity-recognition systems. However, most approaches proposed in this regard have associated great importance to the development of an excellent recognition algorithm rather than adjusting the imbalance in distribution of activity classes. For instance, Wu et al. [14] proposed a mixed-kernel-based weighted extreme learning machine for inertial sensors based on human-activity recognition with an imbalanced dataset [14]. Abidine et al. [15] performed automatic recognition of activities by selecting a suitable regularization parameter C associated with the soft-support vector machines method. Abidine et al. [16] also employed cost-sensitive support vector machines with adaptive tuning of the cost parameter to analyze imbalanced data [16]. In the proposed study, numerous public datasets were investigated under the presumption that imbalanced data distributions are a common occurrence. Additionally, it has been demonstrated via experiments that presence of an imbalanced distribution of activity classes tends to degrade the performance of activity-recognition systems. Contrary to the distribution of activity classes, this study proposes use of a sample-based algorithm, which in turn, serves to improve the synthetic minority oversampling technique (SMOTE) algorithm to adjust the imbalanced distribution of activity classes. Two public datasets are used to evaluate the proposed approach. Experimental results demonstrate that use of the proposed algorithm remarkably improves the performance of activity-recognition systems.

The remainder of this paper is organized as follows. Relevant works are presented in Sect. 2; terminologies associated with activity recognition are defined in Sect. 3; the proposed improved SMOTE algorithm is presented in Sect. 4; the proposed algorithm is evaluated and corresponding results are discussed in Sect. 5; lastly, findings of this study are summarized and future endeavors are declared in Sect. 6.

2 Related Work

This section presents a brief overview on approaches considered previously to address issues concerning activity recognition and imbalanced data adjusting.

2.1 Approaches for Activity Recognition

Activity-recognition approaches can, in general, be classified into data-driven and knowledge-driven approaches. Knowledge-driven approaches lay greater emphasis on the generation of recognition rules by following a heuristic strategy. Such rules are usually represented in some logical language, such as temporal logic or description logic. Post generation of recognition rules, logical reasoning is performed to recognize individual activities. Rugnone et al. [17] utilized temporal logic that represented rules to recognize abnormal activities. Yin et al. [18] and Chen et al. [19] represented recognition rules as ontology, and Chen et al. [20] proposed an improved ontology-based approach. At the core of these approaches lies an iterative process that begins from the so-called “seed” activity models, which are created via ontological engineering, deployed, and subsequently evolved via incremental activity discovery and model updates. Kong [21] proposed a decentralized belief-propagation-based method to facilitate multi-agent task allocation. Knowledge-driven approaches possess superior robustness, since recognition rules can be used in different environments. However, raw data commonly include substantial noise and uncertain information, which are difficult to identify and adversely affect the accuracy of activity recognition.

Data-driven approaches focus on the generation of classification models. Some of these approaches use time-series models, such as the hidden Markov (HMM) or conditional random fields (CRF), to recognize activities. Kasteren et al. [22, 23] employed HMM and hierarchical HMM to realize resident-activity recognition. Tong et al. [24, 25] employed the latent-dynamic and hidden-state CRF models to facilitate recognition of abnormal activities as well as activities of single and multiple residents. A commonality of these approaches is that greater emphasis is laid on the respective orders of activities and sensor events. However, time-series models usually demonstrate poor robustness [26]. For instance, it is rather obvious that daily orders of resident activities are seldom identical. For a given activity, the order of sensor events often changes. Furthermore, the order of activities of a given resident is often different from that of another. To enhance the robustness of time-series models, researchers exploited such static classifiers for activity recognition as the Naive Bayesian (NB), Support Vector Machine (SVM), k-Nearest Neighbor (kNN), and Random Forest (RF) [27]. Cook et al. [28] employed NB to recognize daily activities. Yin et al. [29] employed one-class SVM to recognize abnormal activities on a daily basis [29]. Gu et al. [30] proposed an effective incremental support vector ordinal regression formulation based on a sum-of-margins strategy. Hevesi et al. [31] used the kNN classifier to recognize daily activities. Gu and Sheng [32] proposed use of a regularization-path algorithm for ν-support-vector-based classification. Xia et al. [33] proposed an approach wherein the kNN algorithm and locality-sensitive hash were utilized to construct a secure and efficient index. Gu et al. [34] proposed a structural minimax probability machine for constructing a margin classifier.

2.2 Approaches for Handling Imbalanced Data

Datasets often possess unequal class distributions. This problem is referred to as imbalanced classification. Imbalanced distribution of data renders classifiers prone to be biased towards the majority class and accordingly invites poor classification performance. To address this problem up to a certain extent, a number of imbalance-adjustment strategies have been proposed. These strategies can be classified into the sampling-based and algorithm-based types [35].

The algorithm-based strategy focuses on improving the learning algorithm and includes the ensemble and cost-sensitive learning techniques. In this regard, Zhou et al. [36] proposed an ensemble-learning framework, which encloses cost-sensitive neural networks and classifiers for handling imbalanced classes. Li et al. [37] proposed a cost-sensitive and hybrid attribute measure—referred to as the multi-decision tree—to maximize classification performance whilst minimizing the total misclassification cost. Cheng et al. [38] designed a balanced classifier with imbalanced-data training based on the margin distribution theory.

Contrary to the above, the sampling-based strategy focuses on adjustment of imbalanced data and includes random under-sampling (RUS), random over-sampling (ROS), and SMOTE algorithms [39]. RUS is an sampling technique, which diminishes the majority class. The basic principle underlying RUS is to randomly select and delete a certain number of majority samples whilst reducing the number of minority samples to improve the imbalance within datasets. This method, however, is unlikely to be useful during classification of the deleted sample, which in turn, could cause loss of essential data. Zhang et al. [40] combined the inverse RUS and random tree techniques to implement imbalanced learning [40]. ROS and SMOTE refer two classic over-sampling methods, which involve expansion of the minority class. The basic idea behind ROS is to randomly copy minority samples within a dataset and increase the number of such samples to reduce the imbalance within the said dataset. This method, however, is just a simple copy of the minority sample, which may undergo an over-fitting phenomenon. Zhang et al. [41] proposed a random walk over-sampling approach to balance different class samples by creating synthetic samples using randomly walking from the real data. Use of the SMOTE algorithm adds artificial samples to the minority class. However, SMOTE does not performing oversampling based on a simple sample copy; instead, it generates new minority samples beyond the original dataset, thereby avoiding over-fitting of classifiers up to a certain extent. Sáez et al. [42] introduced an iterative ensemble-based noise filter into SMOTE, thereby enabling it to overcome problems related to noisy and borderline examples of imbalanced classification. Yu et al. [43, 44] integrated deep multimodal technology and SMOTE to facilitate image retrieval and ranking. Wang et al. [45] designed a back-propagating neural-network model using solar radiation as an input parameter to establish the relationship between solar radiation and air-temperature error whilst considering all data samples [45]. Ma et al. [46] proposed an efficient detection algorithm based on structural clustering to convert the structural similarity between vertices to network weights.

3 Terminologies

Prior to presenting the proposed approach, certain terminologies must be defined in advance. For the sake of clarity, a segment of activity records has been described in Table 1.

Table 1 A stream segment of sensor events

Full size table

Definition 1

For a given sensor s, sr = (d, h, m, sn, sv, al) denotes a sensor event, such that if s denotes a run, d refers to the date when s was run, h denotes the corresponding hour, and m represents the corresponding minute. Accordingly, sn denotes the name of s; sv denotes the value of s, and al denotes an explanatory activity label.

Throughout this manuscript, sr.d, sr.h, sr.m, sr.sn, sr.sv, and sr.al are used to represent the tuples d, h, m, sn, sv, and al, respectively, of a sensor event sr. The notation Ω is used to represent a set of sensor events.

For example, the expression “2011-06-15 00:25:01.892474 LS013 7 Sleep” implies that a sensor LS013 has been activated at 00:25:01.892474 on 2011-06-15 with a measured value of 7, and at the said time, the concerned resident was sleeping.

Definition 2

Given two sensor events sr₁ and sr₂, sr₁ is considered to be the precursor of sr₂ if sr₁.d < sr₂.d holds or (sr₁.d = = sr₂.d AND sr₁.h < sr₂.h) or (sr₁.d = = sr₂.d AND sr₁.h = = sr₂.h AND sr₁.m < sr₂.m) holds. The event sr₂ is considered to be the successor of sr₁ if sr₁ is the precursor of sr₂.

Throughout this manuscript, the expression sr₁< sr₂ indicates that sr₁ is the precursor of sr₂.

For example, the event {2011-06-15 00:25:01.892474 LS013 7 Sleep} is the precursor of {2011-06-15 01:05:01.622637 BATV013 9460 Sleep}. Similarly, the event {2011-06-15 01:05:01.622637 BATV013 9460 Sleep} is the successor of {2011-06-15 00:25:01.892474 LS013 7 Sleep}.

Definition 3

Given two sensor events sr₁ and sr₂ such that sr₁< sr₂ holds, sr₁ is considered the direct precursor of sr₂ if ¬∃ sr ∈ Ω, sr₁< sr AND sr < sr₂ holds. The event sr₂ is said to be the direct successor of sr₁ if the event sr₁ is the direct precursor of sr₂.

Throughout this manuscript, sr₁→ sr₂ indicates that the event sr₁ is the direct precursor of sr₂.

For example, the expression {2011-06-15 00:25:01.892474 LS013 7 Sleep} is the direct precursor of {2011-06-15 01:05:01.622637 BATV013 9460 Sleep}. Likewise, the event {2011-06-15 01:05:01.622637 BATV013 9460 Sleep} is the direct successor of {2011-06-15 00:25:01.892474 LS013 7 Sleep}.

Definition 4

Given an activity a and n sensor events sr₀, sr₁, sr₂,…, sr_n, sr_n+1, the term SR denotes the sensor sequence of a if ∀1 ≤ i ≤ n sr_i.al == a AND sr₀ ≠ a AND sr_n+1 ≠ a AND ∀2 ≤ i ≤ n − 1 sr_i→ sr_i+1 holds.

Definition 5

For an activity a and a sequence of ambient sensors sr₁, sr₂, …, sr_n of a, ar = (sr₁.h, sr_n.h, u, SNT, a) refers to an activity record. The term u denotes the approximate duration of activity a; SNT denotes a spatial feature and can be defined as a set {(srn, T)}, where srn ∈ {sr_i.sn|1 ≤ i ≤ n} denotes the name of a sensor; lastly, T = |{sr_i|1 ≤ i ≤ n ∧ sr_i.sn = srn}| denotes the frequency at which sensors named srn get activated. Terms u and SNT could be solved for using Algorithm 1 described below.

For the sample sequence described in Table 1, “(0, 3, 212, {(MA021, 2), (BATV012, 1), (BATV013, 1), (LS013, 2)}, Sleep)” describes an activity record of “Sleep.” The corresponding duration u is “212” min, because the approximate duration between the start and end times of “Sleep” equals “212” min. Ambient sensors “MA021” and “LS013” were each run twice while corresponding sensors “BATV012” and “BATV013” were each run once.

4 Methodology

This section briefly describes the improved SMOTE (ISMOTE) algorithm to balance the imbalanced distribution of activity classes. As can be observed in Fig. 1, SMOTE uses linear interpolation between two points to generate a new minority-class data sample, thereby limiting the range of sample generation. To address this problem, the ISMOTE algorithm generates new synthetic minority activities in the neighborhood of remaining minority-class samples. Two specified constraints are used to control the newly synthetic samples, thereby facilitating their generated in a robust manner. Compared to SMOTE, use of the ISMOTE algorithm leads to improving the generalization ability of a far greater number of classifiers. Additionally, the ISMOTE algorithm is capable of performing a more even and reasonable distribution of positive examples after balancing. Lastly, ISMOTE can also generate samples similar to minority samples generated by the ROS and SMOTE algorithms.

The proposed ISMOTE approach can be appropriately described by algorithms 2 and 3 as well as its schematic flowchart (Fig. 2), as described hereunder. In algorithm 2, lines 1 and 2 describe parameter initialization, whereas lines 3–8 describe the statistics associated with the number of activity records for each activity class. Subsequently, line 10 describes calculation of the degree of imbalance within each class (Im_D). The Euclidean distance was used to determine activity records of k nearest neighbors (lines 11–14). Subsequently, algorithm 3 was used to generate synthetic minority-class activity records. Line 1 in algorithm 2 involves random selection of Im_D activity records of k nearest neighbors. Subsequently, lines 2–9 in algorithm 2 correspond to generation of synthetic minority-class activity records from the high-dimensional space. In case the newly generated synthetic minority-class activity records do not meet the specified constraints, the proposed ISMOTE algorithm would regenerate a new set of minority-class activity records (lines 10–12 in algorithm 2), and the above process would be repeated. Ultimately, a balanced set of activity records would be generated through use of the ISMOTE approach.

5 Results and Evaluation

5.1 Datasets

To validate the proposed algorithm, two public datasets—”HH102” and “HH104”—were considered. These datasets have been published by the Washington State University [47]. Statistical information concerning the two data sets are described in Table 2. Values listed under column “Sensors” correspond to the number of sensors involved and their corresponding categories. Similarly, values listed under column “Activities” correspond to the number of activity classes involved while those listed under column “Activity Records” correspond to the number of involved activity records. Lastly, values listed under “Measurement Time” correspond to durations over which data were collected duration that data is collected.

Table 2 Statistical information concerning datasets “HH102” and “HH104.”

Full size table

For the “HH102,” dataset, the following identifier categories were considered.

(1)
Identifiers starting with “BA” indicate sensor battery levels; for example, BATP013, BATP019, BATV001–BATV023, and BATV102–BATV105.
(2)
Identifiers with names starting with letter “D” indicate magnetic door sensors—D001, D002, D005, and D006.
(3)
Identifiers with names starting with “L” and “LL” indicate light switches—L001–L005, LL001, and LL005.
(4)
Identifiers with names starting with “LS” indicate light sensors—LS001–LS023.
(5)
Identifiers with names starting with “M” indicate infrared motion sensors—M001–M022.
(6)
Identifiers with names starting with “MA” indicate wide-area infrared motion sensors—MA003, MA009, MA010, MA013, MA014, MA020, and MA023.
(7)
Identifiers with names starting with “T” indicate temperature sensors—T101–T105.

Involved activities include “Sleep” (“S”), “Bathe” (“B”), “Dress” (“D”), “Eat_Breakfast” (“E_B”), “Eat_Dinner” (“E_D”), “Groom” (“G”), “Take_Medicine” (“T_M”), “Toilet” (“T”), “Wash_Breakfast_Dishes” (“W_B_D”), “Wash_Dinner_Dishes” (“W_D_D”), “Watch_TV” (“W_T”), and “Work_At_Table” (“W_A_T”). Number of samples considered for the above activity classes and corresponding degrees of imbalance are listed in Table 3.

Table 3 Activity-class distribution for “HH102” dataset

Full size table

Similarly, for the “HH104,” dataset, the following identifier categories were considered.

(1)
Identifiers starting with “BA” indicate sensor battery levels; for example, BATP001–BATP006, BATP101–BATP106, BATV001–BATV026, and BATV101–BATV106.
(2)
Identifiers with names starting with “D” indicate magnetic door sensors—D001–D006.
(3)
Identifiers with names starting with “L” and “LL” indicate light switches—L001–L006.
(4)
Identifiers with names starting with “LS” indicate light sensors—LS001–LS026.
(5)
Identifiers with names starting with “M” indicate infrared motion sensors—M001–M013, M016, and M020–M026.
(6)
Identifiers with names starting with “MA” indicate wide-area infrared motion sensors—MA014, MA015, MA017–MA019, and MA022.
(7)
Identifiers with names starting with “T” indicate temperature sensors—T101–T107.

Involved activities include “Sleep_Out_Of_Bed” (“S_O_O_B”), “Evening_Meds” (“E_M”), “Dress”(“D”), “Cook_Breakfast” (“C_B”), “Cook_Dinner” (“C_D”), “Phone” (“P”), “Take_Medicine” (“T_M”), “Toilet” (“T”), “Wash_Breakfast_Dishes” (“W_B_D”), “Wash_Dinner_Dishes” (“W_D_D”), “Morning_Meds” (“M_M”), and “Work_On_Computer” (“W_O_C”). Number of samples considered for the above activity classes and corresponding degrees of imbalance are listed in Table 4.

Table 4 Activity-class distribution for “HH104” dataset

Full size table

5.2 Results and Evaluation Metrics

In this study, the ISMOTE algorithm was compared against the SMOTE algorithm and “Primary” through use of four classifiers—NB, SVM, C4.5, and RF. The term “Primary” implies that individual activities are recognized through use of a classifier, which does not include any algorithm for imbalanced-data adjustment. With regards to the SMOTE and ISMOTE algorithms, k was assigned a value of 5. The used toolset employed was Weka 3.9, and a 3-fold cross validation was performed. Evaluation metrics considered included accuracy, precision, and F-measure.

Average recognition accuracies concerning the HH102 and HH104 datasets are depicted in Figs. 3 and 4. The average accuracy achieved by “Primary” was observed to be almost equal to that achieved by SMOTE, whereas the average accuracy of ISMOTE exceeded that those of both SMOTE and Primary when employing SVM, C4.5 and RF. In contrast, when using NB as a classifier, the observed accuracy of ISMOTE was lower compared to that of SMOTE and Primary. The two best accuracy values were achieved by applying the ISMOTE algorithm when using RF as the classifier.

Trends concerning recognition accuracy for individual activities when considering the HH102 dataset are depicted in Figs. 5, 6, 7 and 8. Corresponding trends concerning the HH104 dataset are depicted in Figs. 9, 10, 11 and 12. For the HH102 dataset, use of the ISMOTE algorithm demonstrates higher accuracies for 11, 9, 10, and 9 activities, respectively, when compared against SOMTE and Primary with SVM, NB, C4.5, and RF being used as classifiers. Correspondingly, for the HH104 dataset, ISMOTE achieved higher accuracies for 10, 8, 11, and 11 activities, respectively, in comparison to SMOTE and Primary when employing SVM, NB, C4.5, and RF as classifiers.

As depicted in Figs. 3 and 4, the average accuracy achieved by “Primary” equaled 74% and 75%, respectively, both of which correspond to the lowest accuracy of activity recognition achieved when employing the four classifiers (SVM, NB, C4.5, RF). Additionally, NB is not suitable for use as a classification algorithm for activity recognition in conjunction with the ISMOTE approach for adjustment of imbalanced data. Conversely, RF can be observed to be the most suitable classification algorithm for activity recognition when used in conjunction with the ISMOTE approach.

In addition, as depicted in Fig. 6, as regards activity-recognition accuracy concerning the HH102 dataset when using ISMOTE in conjunction with the NB classifier, 8 out of 12 activity recognitions were observed to be better compared to those by Primary and SMOTE. Similarly, for activity-recognitions accuracies concerning the HH104 dataset (Fig. 10) when using ISMOTE in conjunction with the NB classifier, 9 out of 12 activity recognitions were observed to be better compared to those by Primary and SMOTE. Additionally, for the HH104 dataset, recognition accuracy of activities with fewer occurrences, such as “Sleep_Out_Of_Bed,” “Phone,” “Take_Medicine,” and “Wash_Dinner_Dishes,” were all observed to be higher compared to those achieved by Primary and SMOTE. This demonstrates the greater ability of the proposed ISMOTE algorithm with regards to accurate recognition of not-so-frequent activities.

Average recognition precisions of the Primary, SMOTE, and ISMOTE algorithms when applied to the HH102 and HH104 datasets are depicted in Figs. 13 and 14, respectively. As can be observed, the average precision achieved by Primary nearly equals that achieved by SMOTE. For the HH102 dataset, as previously stated, the average precision achieved by ISMOTE, in general, exceeds that achieved by SMOTE and Primary when employing SVM, C 4.5, and RF as classifiers, whereas it nearly equals the average precision of SMOTE and Primary when employing the NB classifier. With regard to the HH104 dataset, the average precision achieved by ISMOTE exceeds those of SMOTE and Primary regardless of the classifier type being used. For both datasets, the best precision was achieved by applying ISMOTE algorithm with RF classifier.

Trends concerning precision of individual-activity recognition for the HH102 dataset are depicted in Figs. 15, 16, 17 and 18. Corresponding trends with regards to the HH104 dataset are depicted in Figs. 19, 20, 21 and 22. For the HH102 dataset, ISMOTE achieves higher precisions in 7, 5, 8, and 8 activities, respectively, in comparison to SMOTE and Primary when employing SVM, NB, C 4.5 and RF as classifiers. Correspondingly, for the HH104 dataset, ISMOTE achieves higher precisions in 9, 6, 12, and 11 activities, respectively, in comparison to SMOTE and Primary when employing SVM, NB, C 4.5 and RF as classifiers.

A comparison of F-measure values obtained when employing the Primary, SMOTE, and ISMOTE algorithms when applied to the HH102 and HH104 datasets are depicted in Figs. 23 and 24, respectively. In this case, performances of the Primary and SMOTE algorithms can be observed to be nearly identical. For both datasets, F-measure values achieved by ISMOTE exceed those achieved by SMOTE and Primary when employing SVM, C 4.5, and RF classifiers. However, F-measure values of ISMOTE when employing NB as classifier are lesser compared to those of SMOTE and Primary for both datasets. In both cases, best F-measure values are achieved when employing the ISMOTE algorithm in conjunction with RF.

In the proposed approach, the parameter k was used to control the scope of generated synthetic activities. During experimentation, values of k were set to 3, 5, 7, and 9, respectively, to discuss the effect of the value range of k on results reported in the revised manuscript. As observed from experimental results, use of RF as a classifier demonstrated attainment of the highest average values of accuracy, precision, and F-measure in both datasets. These results demonstrate RF as the most suitable classifier to be employed for activity recognition, in conjunction with the proposed ISMOTE algorithm. In the proposed experiment, therefore, the RF classifier was used to compare results obtained when using different k values. A comparison of average percentage accuracy and F-measure values obtained for the HH102 and HH104 datasets, respectively, when employing different values of k are depicted in Figs. 25 and 26. For the HH102 dataset, the observed highest average accuracy and F-measure values using RF corresponded to k = 5, and the same was true for the HH104 dataset, too. However, the F-measure value is higher when we set k = 5 rather than k = 7. Thus, in our experiment, we set the k = 5.

There exist two main classes of imbalance learning strategies—sampling methods and cost-sensitive techniques. In this study, use of four well-known strategies was investigated. Three of these strategies qualify as sampling methods, namely, random under-sampling (RUS), random over-sampling (ROS) and synthetic minority over-sampling technique (SMOTE). The fourth strategy qualifies as a cost-sensitive technique, referred to as the cost-matrix adjuster, wherein the cost matrix is adjusted. The above four strategies were specifically considered because they are popular and diverse. RUS refers to an under-sampling technique, wherein the size of the majority class is reduced. ROS and SMOTE are two classic over-sampling methods, wherein the minority class is expanded. The difference between them is that in ROS, duplicated samples are added to the minority class, whereas in SMOTE, artificial samples are created to be added to the minority class. Finally, the cost-matrix adjuster (CMA) is a cost-sensitive technique.

As suggested by results observed in this study, use of the RF classifier demonstrates attainment of highest average values of accuracy, precision and F-measure for both datasets considered. RF can, therefore, be considered as the most suitable classifier type for activity recognition. Thus, RF was also employed in this study to compare activity-recognition results obtained when using different imbalance learning strategies—RUS, ROS, SMOTE, and CMA. A comparison of average percentage accuracy and F-measure values obtained for the HH102 and HH104 datasets, respectively, when employing different imbalance learning strategies are depicted in Figs. 27 and 28. As observed, for both datasets, the observed highest average accuracy and F-measure values using RF corresponded to use of the ISMOTE algorithm. For the HH102 dataset, the observed accuracy of ISMOTE using RF exceeded those of the primary, CMA, RUS, ROS, and SMOTE by 7.40%, 4.65%, 21.62%, 3.45%, and 8.91%, respectively. With regard to the HH104 dataset, corresponding values equaled 7.95%, 5.56%, 93.88%, 3.26%, and 7.95%, respectively.

5.3 Discussion

In accordance with results obtained in this study, following points must be noted.

(1)
Use of the ISMOTE algorithm with RF demonstrates attainment of highest average values of accuracy, precision, and F-measure. The observed highest average accuracy value equaled 90% and 95% for the HH102 and HH104 datasets, respectively. Correspondingly, the highest average precision value equaled 90% for HH102 and 96% for HH104. Lastly, the highest average F-measure value equaled 90% for HH102 and 95% for HH104.
(2)
Use of the RF classifier demonstrated attainment of the highest average values of accuracy, precision, and F-measure for both two datasets. RF can, therefore, be considered as the best classifier type to be used for activity recognition.
(3)
In dataset HH102, there exist 8 activity classes with more than one degrees of imbalance. Table 5 lists improvements in accuracy and precision values of these activity classes obtained through use of RF. For a given activity class a, the value of “I-P” denotes the difference between accuracy (precision) of a obtained using ISMOTE and that obtained using Primary. Likewise, the value of “I-S” denotes the difference between accuracy (precision) of a obtained using ISMOTE and that obtained using SMOTE. Accuracy and precision values of 6 activity classes were observed to have been remarkably improved in this study. Correspondingly, in dataset HH104, there existed 10 such activity classes. Table 6 lists improvements in accuracy and precision values of these activity classes obtained through use of RF. Accuracies of 11 activity classes and precisions of 10 activity classes were observed to have been remarkably improved. Observed results demonstrate ISOMTE to be a promising algorithm for use in activity-recognition applications.
Table 5 Accuracy and precision improvements achieved by applying RF to HH102 dataset
Full size table
Table 6 Accuracy and precision improvements achieved by applying RF to HH104 dataset
Full size table

6 Conclusions

This paper presents the ISMOTE algorithm as a means of realizing adjustment of imbalanced activity classes with regard to activity-recognition applications. The proposed algorithm was evaluated using four classifiers on two public datasets, and results obtained in this study demonstrate the ability of the ISMOTE algorithm to dramatically improve the performance of activity-recognition systems.

References

Peetoom KKB, Lexis MAS, Joore M, Dirksen CD, Witte LPD (2015) Literature review on monitoring technologies and their outcomes in independently living elderly people. Disabi Rehabil Assist Technol 10(4):271–294
Article Google Scholar
Bloch F, Gautier V, Noury N, Lundy JE, Poujaud J, Claessens YE, Rigaud AS (2011) Evaluation under real-life conditions of a stand-alone fall detector for the elderly subjects. Ann Phys Rehabil Med 54:391–398
Article Google Scholar
Arcelus A, Herry CL, Goubran RA, Knoefel F (2009) Determination of sit-to-stand transfer duration using bed and floor pressure sequences. IEEE Trans Bio-med Eng 56(10):2485–2492
Article Google Scholar
Shen J, Chang S, Shen J, Liu Q, Sun X (2016) A lightweight multi-layer authentication protocol for wireless body area networks. Fut Gen Comput Syst. https://doi.org/10.1016/j.future.2016.11.033
Article Google Scholar
Sun Y, Gu F (2017) Compressive sensing of piezoelectric sensor response signal for phased array structural health monitoring. Int J Sens Netw 23(4):258–264
Article Google Scholar
Zhang J, Tang J, Wang T, Chen F (2017) Energy-efficient data-gathering rendezvous algorithms with mobile sinks for wireless sensor networks. Int J Sens Netw 23(4):248–257
Article Google Scholar
Wan J, O’Grady MJ, O’Hare GMP (2015) Dynamic sensor event segmentation for real-time activity recognition in a smart home context. Pers Ubiquit Comput 19(2):287–301
Article Google Scholar
Zhang Y, Sun X, Wang B (2016) Efficient algorithm for K-Barrier coverage based on integer linear programming. China Commun 13(7):16–23
Article Google Scholar
Kim Y, Lee H, Provost EM (2013) Deep learning for robust feature generation in audiovisual emotion recognition. In: IEEE international conference on acoustics, speech and signal processing, Vancouver, BC, Canada, 26–31 May 2013
Yu J, Rui Y, Tao D (2014) Click prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 23(5):2019–2032
Article MathSciNet Google Scholar
Hong C, Yu Ju, Tao D (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670
Article MathSciNet Google Scholar
Hong C, Yu J, Tao D, Wang M (2015) Image-based three-dimensional human pose recovery by multiview locality-sensitive sparse retrieval. IEEE Trans Industr Electron 62(6):3742–3751
Google Scholar
Fleury A, Noury N, Vacher M, Glasson H (2008) Sound and speech detection and classification in a health smart home. In: 30th annual international conference of the IEEE engineering in medicine and biology society, Vancouver, Canada, 20–25 Aug 2008
Wu D, Wang Z, Chen Y, Zhao H (2016) Mixed-kernel based weighted extreme learning machine for inertial sensor based human activity recognition with imbalanced dataset. Neurocomputing 19:35–49
Article Google Scholar
Abidine MB, Fergani B, Clavier L (2013) Importance-weighted the imbalanced data for C-SVM classifier to human activity recognition. In: The proceedings of 8th international workshop on systems, signal processing and their applications, Zeralda, Algeria, 12–15 May 2013
Abidine MB, Fergani B (2014) A new classification strategy for human activity recognition using cost sensitive support vector machines for imbalanced data. Kybernetes 43(8):1150–1164
Article Google Scholar
Rugnone A, Poli F, Vicario E, Nugent CD, Tamburini E, Paggetti C (2007) A visual editor to support the use of tempora logic for adl monitoring. In: Pervasive computing for quality of life enhancement: 5th international conference on smart homes and health telematics, Nara, Japan, 21–23 June 2007, pp 217–225
Latfi F, Lefebvre B, Descheneaux C (2007) Ontology-based management of the telehealth smart home, dedicated to elderly in loss of cognitive autonomy. In: Proceedings of the OWLED 2007 workshop on OWL: experiences and directions, Innsbruck, Austria, 6–7 June 2007
Chen L, Nugent C (2009) Ontology-based activity recognition in intelligent pervasive environments. Int J Web Inf Syst 5(4):410–430
Article Google Scholar
Chen L, Nugent C, Okeyo G (2014) An ontology-based hybrid approach to activity modeling for smart homes. IEEE Trans Hum–Mach Syst 44(1):92–105
Article Google Scholar
Kong Y, Zhang M, Ye D (2016) A belief propagation-based method for task allocation in open and dynamic cloud environments. Knowl-Based Syst 115:123–132
Article Google Scholar
Kasteren TLMV, Englebienne G, Kröse BJA (2011) Hierarchical activity recognition using automatically clustered actions. Lect Notes Comput Sci 7040:82–91
Article Google Scholar
Singla G, Cook D, Schmitter EM (2010) Recognizing independent and joint activities among multiple residents in smart environments. J Ambient Intell Humaniz Comput 1(1):57–63
Article Google Scholar
Tong Y, Chen R, Gao J (2015) Hidden state conditional random field for abnormal activity recognition in smart homes. Entropy 17(3):1358–1378
Article Google Scholar
Tong Y, Chen R (2014) Latent-Dynamic Conditional Random Fields for recognizing activities in smart homes. J Ambient Intell Smart Environ 6(1):39–55
Google Scholar
Bourobou STM, Yoo Y (2015) User activity recognition in smart homes using pattern clustering applied to temporal ANN algorithm. Sensors 15:11953–11971
Article Google Scholar
Liu YQ, Ouyang DT, Liu Y, Chen R (2017) A novel approach based on time cluster for activity recognition of daily living in smart homes. Symmetry 9(10):212
Article Google Scholar
Cook D, Schmitter-Edgecombe M (2009) Assessing the quality of activities in a smart environment. Methods Inf Med 48(5):480–485
Article Google Scholar
Yin J, Yang Q, Pan JJ (2008) Sensor-based abnormal human-activity detection. IEEE Trans Knowl Data Eng 20(8):1082–1090
Article Google Scholar
Gu B, Sheng VS, Tay KY, Romano W, Li S (2015) Incremental Support Vector Learning for Ordinal Regression. IEEE Trans Neural Netw Learn Syst 26(7):1403–1416
Article MathSciNet Google Scholar
Hevesi P, Wille S, Pirkl G, When N, Lukowicz P (2014) Monitoring household activities and user location with a cheap, unobtrusive thermal sensor array. In: Proceedings of the 2014 ACM international joint conference on pervasive and ubiquitous computing, Seattle, WA, USA, 13–17 September 2014, pp 141–145
Gu B, Sheng V (2016) A robust regularization path algorithm for ν-support vector classification. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/tnnls.2016.2527796
Article Google Scholar
Xia Z, Wang X, Zhang L, Qin Z, Sun X, Ren K (2016) A privacy-preserving and copy-deterrence content-based image retrieval scheme in cloud computing. IEEE Trans Inf Forensics Secur 11(11):2594–2608
Article Google Scholar
Gu B, Sun X, Sheng V (2016) Structural minimax probability machine. IEEE Trans Neural Netw Learn Syst 1:1–2. https://doi.org/10.1109/tnnls.2016.2544779
Article Google Scholar
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–67
Article MathSciNet Google Scholar
Li F, Zhang X, Zhang X, Du C, Xu Y, Tian YC (2018) Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets. Inf Sci 422:242–256
Article Google Scholar
Cheng F, Zhang J, Wen C (2016) Cost-sensitive large margin distribution machine for classification of imbalanced data. Pattern Recogn Lett 80:107–112
Article Google Scholar
Yang XL, Lo D, Xia X, Huang Q, Sun JL (2017) High-impact bug report identification with imbalanced learning strategies. J Comput Sci Technol 32(1):181–198
Article Google Scholar
Zhang CX, Wang GW, Zhang JS, Guo G, Ying QY (2014) IRUSRT: a novel imbalanced learning technique by combining inverse random under sampling and random tree. Commun Stat-Simul Comput 43(10):2714–2731
Article MathSciNet Google Scholar
Zhang X, Li M (2014) RWO-Sampling: a random walk over-sampling approach to imbalanced data classification. Inf Fus 20:99–116
Article Google Scholar
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE–IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203
Article Google Scholar
Yu J, Tao D, Wang M, Rui Y (2015) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45(4):767–779
Article Google Scholar
Yu J, Yang X, Gao F, Tao D (2017) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans Cybern 47(12):4014–4024
Article Google Scholar
Wang B, Gu X, Ma L, Yan S (2017) Temperature error correction based on BP neural network in meteorological WSN. Int J Sens Netw 23(4):265–278
Article Google Scholar
Ma T, Wang Y, Tang M, Cao J, Tian Y, Al-Dhelaan A, Al-Rodhaan M (2016) LED: a fast overlapping communities detection algorithm based on structural clustering. Neurocomputing 207:488–500
Article Google Scholar
WSU CASAS Datasets. http://ailab.wsu.edu/casas/datasets.html. Accessed 2 Feb 2016

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos. 61672122, 61602077, 51679105, 51409117); the Fundamental Research Funds for the Central Universities (Nos. 3132016348, 3132018194); ANHUI Province Key Laboratory of Affective Computing & Advanced Intelligent Machine (No. ACAIM20180001); and the Open Project Program of Artificial Intelligence Key Laboratory of Sichuan Province (No. 2018RYJ09).

Author information

Authors and Affiliations

School of Information Science and Technology, Dalian Maritime University, Dalian, 116026, China
Shikai Guo, Yaqing Liu, Rong Chen & Xiangxin Wang
ANHUI Province Key Laboratory of Affective Computing and Advanced Intelligent Machine, Hefei University of Technology, Hefei, 230009, China
Yaqing Liu & Xiao Sun
Artificial Intelligence Key Laboratory of Sichuan Province, Sichuan University of Science and Engineering, Zigong, 643000, China
Yaqing Liu
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China
Yaqing Liu

Authors

Shikai Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yaqing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Rong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Sun
View author publications
You can also search for this author in PubMed Google Scholar
Xiangxin Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed equally to this study and have read and approved the final draft of this manuscript.

Corresponding author

Correspondence to Yaqing Liu.

Ethics declarations

Conflict of interest

The authors declare no conflicts of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guo, S., Liu, Y., Chen, R. et al. Improved SMOTE Algorithm to Deal with Imbalanced Activity Classes in Smart Homes. Neural Process Lett 50, 1503–1526 (2019). https://doi.org/10.1007/s11063-018-9940-3

Download citation

Published: 29 October 2018
Issue Date: October 2019
DOI: https://doi.org/10.1007/s11063-018-9940-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Improved SMOTE Algorithm to Deal with Imbalanced Activity Classes in Smart Homes

Abstract

Similar content being viewed by others

Human Activity Recognition in Smart Home Environment Using OS-WSVM Model

Comparing HMM, LDA, SVM and Smote-SVM Algorithms in Classifying Human Activities

Effects of Smart Home Dataset Characteristics on Classifiers Performance for Human Activity Recognition

1 Introduction