Keywords

1 Introduction

Automatic monitoring of human physical activities has become of great interest in the last years since it provides contextual and behavioral information about a user without explicit user feedback. Being able to automatically detect human activities in a continuous unobtrusive manner is of special interest for applications in sports [16], recommendation systems, and elderly care, to name a few. For example, appropriate music playlists can be recommended based on the user’s current activity (exercising, working, studying, etc.) [21]. Elderly people at an early stage of dementia could also benefit from these systems, like by monitoring their hygiene-related activities (showering, washing hands, or brush teeth) and sending reminder messages when appropriate [19]. Human activity recognition (HAR) also has the potential for mental health care applications [11] since it can be used to detect sedentary behaviors [4], and it has been shown that there is an important association between depression and sedentarism [5]. Recently, the use of wearable sensors has become the most common approach to recognizing physical activities because of its unobtrusiveness and ubiquity, specifically, the use of accelerometers [9, 15, 17], because they are already embedded in several commonly used devices like smartphones, smart-watches, fitness bracelets, etc.

In this paper, we present HTAD: a Home Tasks Activities Dataset. The dataset was collected using a wrist accelerometer and audio recordings. The dataset contains data for common home tasks activities like sweeping, brushing teeth, watching TV, washing hands, etc. To protect users’ privacy, we only include audio data after feature extraction. For accelerometer data, we include the raw data and the extracted features.

There are already several related datasets in the literature. For example, the epic-kitchens dataset includes several hours of first-person videos of activities performed in kitchens [6]. Another dataset, presented by Bruno et al., has 14 activities of daily living collected with a wrist-worn accelerometer [3]. Despite the fact that there are many activity datasets, it is still difficult to find one with both: wrist-acceleration and audio. The authors in [20] developed an application capable of collecting and labeling data from smartphones and wrist-watches. Their app can collect data from several sensors, including inertial and audio. The authors released a datasetFootnote 1 that includes 2 participants and point to another website (http://extrasensory.ucsd.edu) that contains data from 60 participants. However, the link to the website was not working at the present date (August-10-2020). Even though the present dataset was collected by 3 volunteers, and thus, is a small one compared to others, we think that it is useful for the activity recognition community and other researchers interested in wearable sensor data processing. The dataset can be used for machine learning classification problems, especially those that involve the fusion of different modalities such as sensor and audio data. This dataset can be used to test data fusion methods [13] and used as a starting point towards detecting more types of activities in home settings. Furthermore, the dataset can potentially be combined with other public datasets to test the effect of using heterogeneous types of devices and sensors.

This paper is organized as following: In Sect. 2, we describe the data collection process. Section 3 details the feature extraction process, both, for accelerometer and audio data. In Sect. 4, the structure of the dataset is explained. Section 5 presents baseline experiments with the dataset, and finally in Sect. 6, we present the conclusions.

2 Dataset Details

The dataset can be downloaded via: https://osf.io/4dnh8/.

The home-tasks data were collected by 3 individuals. They were 1 female and 2 males with ages ranging from 25 to 30. The subjects were asked to perform 7 scripted home-task activities including: mop floor, sweep floor, type on computer keyboard, brush teeth, wash hands, eat chips and watch TV. The eat chips activity was conducted with a bag of chips. Each individual performed each activity for approximately 3 min. If the activity lasted less than 3 min, an additional trial was conducted until the 3 min were completed. The volunteers used a wrist-band (Microsoft Band 2) and a smartphone (Sony XPERIA) to collect the data.

The subjects wore the wrist-band in their dominant hand. The accelerometer data was collected using the wrist-band internal accelerometer. Figure 1 shows the actual device used. The inertial sensor captures motion from the x, y, and z axes, and the sampling rate was set to 31 Hz. Moreover, the environmental sound was captured using the microphone of a smartphone. The audio sampling rate was set at 8000 Hz. The smartphone was placed on a table in the same room where the activity was taking place.

An in-house developed app was programmed to collect the data. The app runs on the Android operating system. The user interface consists of a dropdown list from which the subject can select the home-task. The wrist-band transfers the captured sensor data and timestamps over Bluetooth to the smartphone. All the inertial data is stored in a plain text format.

Fig. 1.
figure 1

Wrist-band watch.

3 Feature Extraction

In order to extract the accelerometer and audio features, the original raw signals were divided into non-overlapping 3 s segments. The segments are not overlapped. A three second window was chosen because, according to Banos et al. [2], this is a typical value for activity recognition systems. They did comprehensive tests by trying different segments sizes and they concluded that small segments produce better results compared to longer ones. From each segment, a set of features were calculated which are known as feature vectors or instances. Each instance is characterized by the audio and accelerometer features. In the following section, we provide details about how the features were extracted.

3.1 Accelerometer Features

From the inertial sensor readings, 16 measurements were computed including: The mean, standard deviation, max value for all the x, y and z axes, pearson correlation among pairs of axes (xy, xz, and yz), mean magnitude, standard deviation of the magnitude, the magnitude area under the curve (AUC, Eq. 1) , and magnitude mean differences between consecutive readings (Eq. 2). The magnitude of the signal characterizes the overall contribution of acceleration of x, y and z. (Eq. 3). Those features were selected based on previous related works [7, 10, 23].

$$\begin{aligned} AUC = \sum \limits _{t = 1}^T {magnitude(t)} \end{aligned}$$
(1)
$$\begin{aligned} meandif = \frac{1}{{T - 1}}\sum \limits _{t = 2}^T {magnitude(t) - magnitude(t - 1)} \end{aligned}$$
(2)
$$\begin{aligned} Magnitude(x,y,z,t) = \sqrt{{a_x}{{(t)}^2} + {a_y}{{(t)}^2} + {a_z}{{(t)}^2}} \end{aligned}$$
(3)

where \({a_x}{{(t)}^2}\), \({a_y}{{(t)}^2}\) and \({a_z}{{(t)}^2}\) are the squared accelerations at time t.

Figure 2 shows violin plots for three of the accelerometer features: mean of the x-axis, mean of the y-axis, and mean of the z-axis. Here, we can see that overall, the mean acceleration in x was higher for the brush teeth and eat chips activities. On the other hand, the mean acceleration in the y-axis was higher for the mop floor and sweep activities.

Fig. 2.
figure 2

Violin plots of mean acceleration of the x, y, and z axes.

3.2 Audio Features

The features extracted from the sound source were the Mel Frequency Cepstral Coefficients (MFCCs). These features have been shown to be suitable for activity classification tasks [1, 8, 12, 18]. The 3 s sound signals were further split into 1 s windows. Then, 12 MFCCs were extracted from each of the 1 s windows. In total, each instance has 36 MFCCs. In total, this process resulted in the generation of 1, 386 instances. The tuneR R package [14] was used to extract the audio features. Table 1 shows the percentage of instances per class. More or less, all classes are balanced in number.

Table 1. Distribution of activities by class.

4 Dataset Structure

The main folder contains directories for each user and a features.csv file. Within each users’ directory, the accelerometer files can be found (.txt files). The file names are comprised of three parts with the following format: timestamp-acc-label.txt. timestamp is the timestamp in Unix format. acc stands for accelerometer and label is the activity’s label. Each .txt file has four columns: timestamp and the acceleration for each of the x, y, and z axes. Figure 3 shows an example of the first rows of one of the files. The features.csv file contains the extracted features as described in Sect. 3. It contains 54 columns. userid is the user id. label represents the activity label and the remaining columns are the features. Columns with a prefix of v1_ correspond to audio features whereas columns with a prefix of v2_ correspond to accelerometer features. In total, there are 36 audio features that correspond to the 12 MFCCs for each second, with a total of 3 s and 16 accelerometer features.

Fig. 3.
figure 3

First rows of one of the accelerometer files.

5 Baseline Experiments

In this section, we present a series of baseline experiments that can serve as a starting point to develop more advanced methods and sensor fusion techniques. In total, 3 classification experiments were conducted with the HTAD dataset. For each experiment, different classifiers were employed, including ZeroR (baseline), a J48 tree, Naive Bayes, Support Vector Machine (SVM), a K-nearest neighbors (KNN) classifier with \(k=3\), logistic regression, and a multilayer perceptron. We used the WEKA software [22] version 3.8 to train the classifiers. In each experiment, we used different sets of features. For experiment 1, we trained the models using only audio features, that is, the MFCCs. The second experiment consisted of training the models with only the 16 accelerometer features described earlier. Finally, in experiment 3, we combined the audio and accelerometer features by aggregating them. 10-fold cross-validation was used to train and assess the classifier’s performance. The reported performance is the weighted average of different metrics using a one-vs-all approach since this is a multi-class problem.

Table 2. Classification performance (weighted average) with audio features. The best performing classifier was KNN.
Table 3. Classification performance (weighted average) with accelerometer features. The best performing classifier was KNN.
Table 4. Classification performance (weighted average) when combining all features. The best performing classifier was Multilayer perceptron.

Tables 2, 3 and 4 show the final results. When using only audio features (Table 2), the best performing model was the KNN in terms of all performance metrics with a Mathews correlation coefficient (MCC) of 0.761. We report MCC instead of accuracy because MCC is more robust against class distributions. In the case when using only accelerometer features (Table 3), the best model was again KNN in terms of all performance metrics with an MCC of 0.790. From these tables, we observe that most classifiers performed better when using accelerometer features with the exception of Naive Bayes. Next, we trained the models using all features (accelerometer and audio). Table 4 shows the final results. In this case, the best model was the multilayer perceptron followed by KNN. Overall, all models benefited from the combination of features, of which some increased their performance by up to \(\approx \)0.15, like the SVM which went from an MCC of 0.698 to 0.855.

All in all, combining data sources provided enhanced performance. Here, we just aggregated the features from both data sources. However, other techniques can be used such as late fusion which consists of training independent models using each data source and then combining the results. Thus, the experiments show that machine learning systems can perform this type of automatic activity detection, but also that there is a large potential for improvements - where the HTAD dataset can play an important role, not only as an enabling factor, but also for reproducibility.

6 Conclusions

Reproducibility and comparability of results is an important factor of high-quality research. In this paper, we presented a dataset in the field of activity recognition supporting reproducibility in the field. The dataset was collected using a wrist accelerometer and captured audio from a smartphone. We provided baseline experiments and showed that combining the two sources of information produced better results. Nowadays, there exist several datasets, however, most of them focus on a single data source and on the traditional walking, jogging, standing, etc. activities. Here, we employed two different sources (accelerometer and audio) for home task activities. Our vision is that this dataset will allow researchers to test different sensor data fusion methods to improve activity recognition performance in home-task settings.