1 Introduction

The increasing availability of wearable body sensors leads to novel scientific studies and industrial applications [1]. The main large areas include gesture recognition, human activity recognition, and human gait analysis. Several databases have been released for benchmarking; however, due to a wide variety of sensor types and the complexity of activities, these databases are rather distinct. Now, we will review these areas and the corresponding databases in a taxonomic manner.

Gesture recognition (GR) mainly focuses on recognizing hand-drawn gestures in the air. Patterns to be recognized may include numbers, circles, boxes, or Latin alphabet letters. Prediction is usually made on data obtained from smartphone sensors or some special gloves equipped with kinematic sensors, such as 3-axis accelerometers, 3-axis gyroscopes, and occasionally electromyography (EMG) sensors, to measure the electrical potential on the human skin during muscular activities [2]. A database for gesture recognition is available in [3].

Human activity recognition (HAR), on the other hand, aims at recognizing daily lifestyle activities. For instance, an interesting research topic is recognizing activities in or around the kitchen, such as cooking; loading the dishwasher or washing machine; preparing brownies or salads; scrambling eggs; light cleaning; opening or closing drawers, the fridge, or doors; and so on. Often these activities can be interrupted by, for example, answering phones. Databases on this topic include the MIT Place dataset [4, 5], Darmstadt Daily Routine dataset [6], Ambient Kitchen [7], CMU Multi-Modal Activity Database (CMU-MMAC) [8], and Opportunity dataset [9, 10]. In this topic, on-body inertial sensors are usually worn on the wrist, back, or ankle, however, additional sensors are used, such as temperature sensor, proximity sensor, water consumption sensor, heart rate and so on. For instance, CMU-MMAC includes videos, audios, RFID tags, motion capture system based on on-body markers, and physiological sensors such as galvanic skin response (GSR) and skin temperature, which are all located on both forearms and upper arms, left and right calves and thighs, abdomen, and wrists.

Other types of HAR usually focus on walking-related activities, such as walking, jogging, turning left or right, jumping, laying down, going up or down the stairs, and so on. Data on this topic can be found in the WARD dataset [11], PAMAP2 dataset [12, 13], HASC challenge [14,15,16], USC-HAD [17, 18], and MAREA [19]. For data collection, on-body sensors are often placed on the participant’s wrist, waist, ankles, and back.

In some databases, exceptional efforts are taken to provide a reliable benchmark. The body sensor network conference (BSNC) (http://bsncontest.org) [20], for instance, has carried out a contest where organizers provided three different datasets from different research groups. Databases differ in sensor types used and activities recorded. Another team, called the Evaluating Ambient Assisted Living Systems through Competitive Benchmarking – Activity Recognition (EvAAL-AR), provides a service to evaluate HAR systems live on the same activity scenarios performed by an actor [21]. In this contest, each team brings its own activity recognition system, and the evaluation criteria attempt to capture the practical usability: recognition accuracy, user acceptance, recognition delay, installation complexity, and interoperability with ambient-assisted living systems.

Gait analysis focuses not only on the recognition of activities observed but also on how activities are performed. This can be useful in health-care systems for monitoring patients recovering after surgery or fall detection or in diagnosing the state of, for example, Parkinson’s disease [22, 23]. For instance, the Daphnet Gait dataset (DG) [24] consists of recordings of 10 participants affected with Parkinson’s disease instructed to carry out activities that are likely to be difficult to perform, such as walking. The objective is to detect these incidents from accelerometer data recorded from above the ankle, above the knee, and on the trunk. On the other hand, Bovi et al. provide a gait dataset collected from 40 healthy people with various ages as a reference dataset [25]. In the aforementioned BSNC, the third database (ID:IC) contains gait data before knee surgery and 1, 3, 6, 12, and 24 weeks (respectively) after it.

2 Motivation and Design Goals

The main purpose of this dataset is to provide detailed gait data to study how the parts of the legs move individually and relative to each other during activities such as walking, running, standing up, and so on. A summary of the activities can be found in Table 1. This dataset contains continuous recordings of combinations of activities, and the data are segmented and annotated with the label of the activity currently performed. Thus, this dataset is also suitable for analyzing human gait and activities between transitions.

Table 1. Characteristics of HuGaDB

Mainly inertial sensors were used for data acquisition. We decided to use inertial sensors because they are inexpensive, simple to use anywhere such as indoor and outdoor area, and widely available compared with other systems. For instance, compared with video-based motion capture systems, they require expensive video cameras and special full bodysuit with special markers on it. In addition, they are restricted to being used in the installed test area and they are sensitive to lightning and suffer from lost markers phenomenon.

In total, six inertial sensors were placed on the right and left thighs, shins and feet; and data were collected from 18 healthy participants, providing total 10 h of recording. This allows one to investigate how the parts of the legs move individually and relative to each other within and in-between activities. Our dataset could be used as control data, for instance, in health-care-related studies, such as walking rehabilitation or Parkinson’s disease recognition. In virtual reality or gaming, our dataset can be used to model a virtual human movements by reproducing the leg movements from the accelerometer data by simply taking the integrals. In fact, it is not limited to virtual environment and could be used to train to walk and move humanoid robots to make them more humanlike and cope with the uncanny valley.

This dataset is unique in the sense that it is the first to provide human gait data in great detail mainly from inertial sensors and contains segmented annotations for studying the transition between different activities.

3 Data Collection and Sensor Network Topology

In data collection, we used MPU9250 inertial sensors and electromyography (EMG) sensors. Each EMG sensor has a voltage gain is about 5000 and band-pass filter with bandwidth corresponding to power spectrum of EMG (10–500 Hz). A sample rate of each EMG-channel is 1.0 kHz, ADC resolution is 8 bits, input voltages: 0–5 V. The inertial sensors consisted of a 3-axis accelerometer and a 3-axis gyroscope integrated into a single chip. Data were collected with accelerometer’s range equal to \({\pm }2\) g with sensitivity 16.384 LSB/g and gyroscope’s range equal to \(\pm 2000^{\circ }/\)s with sensitivity 16.4 LSB \(/^{\circ }/\)s. All sensors are powered from a battery, that helps to minimize electrical grid noise.

Accelerometer and gyroscope signals were stored in int16 format. EMG signals are stored in uint8. Therefore, accelerometer data can be converted to m/s\(^2\) by dividing raw data 32768 and multiplying it by 2g. Raw gyroscope data can be converted to \( ^{\circ }/\)s by multiplying it by 2000/32768. Raw EMG data can be converted to Volts by multiplying it 0.001/255. We kept the raw data in our data collection in case one prefers other normalization techniques.

In total, three pairs of inertial sensors and one pair of EMG sensors were installed symmetrically on the right and left legs with elastic bands. A pair of inertial sensors were installed on the rectus femoris muscle 5 cm above the knee, a pair of sensors around the middle of the shinbone at the level where the calf ends, and a pair on the feet on the metatarsal bones. Two EMG sensors were placed on vastus lateralis and connected to the skin with three electrodes. The locations of the sensors are shown in Fig. 1. In total, 38 signals were collected, 36 from the inertial sensors and 2 from the EMG sensors.

The sensors were connected through wires with each other and to a microcontroller box, which contained an Arduino electronics platform with a Bluetooth module. The microcontroller collected 56.3500 samples per second in average with standard deviation (std) 3.2057 and then transmitted them to a laptop through Bluetooth connection.

The data were collected from 18 participants. These participants were healthy young adults: 4 females and 14 males, average age of 23.67 (std: 3.69) years, an average height of 179.06 (std: 9.85) cm, and an average weight of 73.44 (std: 16.67) kg.

The participants performed a combination of activities at normal speed and casual way, and there were no obstacles placed on their way. For instance, a participant was instructed to perform the following activities: starting from a sitting position, sitting - standing up - walking - going up the stairs - walking - sitting down. The experimenter recorded the data continually using a laptop and annotated the data with the activities performed. This provided us a long, continuous sequence of segmented data annotated with activities. We developed our own data collector program. In total, 2,111,962 samples were collected from all the 18 participants, and they provided a total of 10 h of data.

Data acquisition was carried out mainly inside a building. However, activities such as running, bicycling, and sitting in a car were performed outside. We collected data in a moving elevator and vehicle. In these scenarios, the activities performed were simply standing or sitting. However, a force impact on the accelerometer sensors and in certain applications, it may be important to consider these facts. Note that we did not collect data on a treadmill.

Fig. 1.
figure 1

Location of sensors. EMG sensor are shown as circles while boxes represent inertial sensors

4 Data Format

Data obtained from the sensors were stored in flat text files. We decided to store the data in flat files because they have one of the most universal formats, and they can be easily preprocessed in all programming languages on every system. One data file contains one recording, which is either a single activity (e.g., walking) or a series of activities. Every file name was created according to the template HGD_vX_ACT_PR_CNT.txt. HGD is a prefix that means human gait data and vX means the version of the data files, currently v1. ACT is a variable, and it denotes the activity ID that was performed. If a file contains a series of different types of activities, then it is indicated as VARIOUS. PR indicates the ID of the person who performed the activity. Data recording was repeated a few times, and CNT is a counter for this. For example, a file named HGD_v1_walking_17_02.txt contains data from participant 17 while he was walking for the second time. The file naming convention is summarized in Table 2.

Table 2. Description of the file naming convention

The main body of the data files contains tab-delimited raw, unnormalized data obtained from the sensors directly. Each data file starts with a header, which contains metainformation. It summarizes the list of activities, the IDs of the activities recorded, and the time and date of the recording. This is summarized in Table 3.

Table 3. Description of the data file header

The main data body of every file has 39 columns. Each column corresponds to a sensor, and one row corresponds to a sample. The order of the columns is fixed. The first 36 columns correspond to the inertial sensors, the next 2 columns correspond to the EMG sensors, and the last column contains the activity ID. The activities are coded as shown in Table 1. The inertial sensors are listed in the following order: right foot (RF), right shin (RS), right thigh (RT), left foot (LT), left shin (LS), and left thigh (LT), followed by right EMG (R) and left EMG (L). Each inertial sensor produces three acceleration data on x, y, z axes and three gyroscope data on x, y, z axes. For instance, the column named ‘RT_acc_z’ contains data obtained from the z-axis of accelerometer located on the right thigh.

Sample data with respect to the activities are visualized through a heat map representation in Fig. 2.

Fig. 2.
figure 2

Data visualization. For normalization data from initial sensors were divided by 32768 and data from EMG were subtracted by 128 and divided by 128

A screenshot of some part of data file can be seen in Fig. 3

Fig. 3.
figure 3

Screenshot of the data file

The data files can be loaded easily in most of the popular programming languages. For instance, they can be loaded in Python using the following script:

figure a

Please note that it requires NumPy library. It also can be loaded in Matlab with the following one-line command:

figure b

We have prepared a script to load the data into SQLite database, which is available at the database’s website: https://github.com/romanchereshnev/HuGaDB/blob/master/Scripts/create_db.py.

Fig. 4.
figure 4

Data variance during walking. (A) Activity performed by the same user multiple times. (B) Activity performed by different users. Legend indicates the source of the data. Data are scaled to the range \([-1,+1]\).

5 Discussion on Data Variance

We were interested seeing the variance among the data collected, in particular, the data variance (A) within a single user and (B) between several users. For this reason, we plotted in Fig. 4 the x-axes acceleration data from the thigh recorded during a short two-three-step walk. Panel A shows the data from various recordings performed by the same user. It can be seen that the data variance at a single frame is quite low suggesting that people perform activities very similar way. On the other hand, panel B shows data obtained from six different, randomly chosen users. Here, a much higher variance can be seen in the same frames compared to the previous case. The increased variance may arise from several facts including: difference in gait, difference in leg shape, sensors mounted in slightly different positions, etc. We obtained similar conclusions on data obtained from different sensors during different activities. We note that, even higher variance was observed in the EMG data, which resulted from the difference in the electricity conduction characteristics of the skin, skin thickness, etc.

Taking into account the high data variance between different users, we emphasize the importance of proper evaluation of machine learning methods developed for human activity recognition. Therefore, we propose using the supervised cross-validation approach for constructing training and test sets [26]. In this approach, all the data from a designated user are held out only for tests and the data from the other 17 participants are used for training. Thus, this approach provides a reliable estimation of how an activity recognition system would perform with a new user whose data was not seen before.

Variance can arise from using different brands of sensors. Unfortunately, we did not have the capacity to collect data from different brands of sensors. We hope the measurement noise is small in general and that different sensors can be calibrated to be compatible with each other.

6 Availability

The database is available free of charge at https://github.com/romanchereshnev/HuGaDB (455 Mb).

7 Summary

The HuGaDB dataset contains detailed kinematic data for analyzing human gait and activity recognition. This dataset differs from previously published datasets in the sense that HuGaDB provides human gait data in great detail mainly from inertial sensors and contains segmented annotations for studying the transition between different activities. Data were obtained from 18 participants, and in total, they provide around 10 h of recording. This dataset can be used in health-care-related studies, such as walking rehabilitation, or in modeling human movements in virtual reality or humanoid robotics. The dataset will be updated with new data from new participants in the future.