Keywords

1 Accurate Indoor Localization

Do you know how to accurately get you location information under unpredictable changes in environmental conditions? In recent years, with the development of mobile Internet, location-based services (LBSs) [1] have been widely used in our daily life, expanded from traditional navigation to real-time applications such as shared mobility and social network. With the development of LBS applications, the location area extends from outdoors to indoors, which creates great requirement of indoor localization with high accuracy. Indoor localization can be implemented in a variety of ways, such as base station, video, infrared, Bluetooth, Wi-Fi [2]. In which, Wi-Fi-based indoor localization has become the most popular way because of the wide coverage of Wi-Fi access points and the rapid development of intelligent terminals [35]. Although the research of indoor localization based on Wi-Fi has made great progress, in highly dynamic environments, due to the influence of multipath effect, environment changing and personnel flows, the fluctuation of wireless signal is large. High accuracy indoor localization still faces the problems of (1) the lack of large-scale labeled data in data layer, (2) the fluctuation of signal strength in feature layer, and (3) the weak adaption ability in model layer, which resulting in low location accuracy, rough trajectory granularity, and weak robustness. For the challenges above, this section will introduce some accurate indoor localization techniques.

1.1 Context-Adaptive Localization Model

The wireless signal fingerprint-based indoor localization model is actually a mapping between the high-dimensional signal space and the physical space. For this kind of mapping models, the input \({\mathbf{x}}\) is the feature vector extracted from the wireless signal strength, and the output \({\mathbf{y}}\) is the position coordinate. Training the location model is to optimize the  objective function \(f = {\text{argmin}}_{f} \mathop \sum \nolimits_{i = 1}^{N} \left| {f\left( {x_{i} } \right) - y_{i} } \right|^{2}\) with the given samples \(\left\{ {\left( {x_{i} ,y_{i} } \right)|i = 1, \ldots ,N} \right\}\).

However, for highly dynamic environments, a context-adaptive model is necessary. This adaptive model should include the minimization of fitting errors and the context-adaptive constraints, as shown in Fig. 1.1, where \(f_{{{\text{itting}}\_{\text{err}}}} \left( {x,y} \right)\) represents the fitting errors between model’s output and calibration results, and \(g\left( {c_{1} ,c_{2} , \ldots , c_{k} } \right)\) represents the constraints constructed with multi-source information of \(c_{1} ,c_{2} , \ldots , c_{k}\). In addition, it is flexible to construct these constraints’ information according to specific scenarios context, including multi-source signals, motion information, and user activities.

Fig. 1.1
figure 1

Context-adaptive location model for high dynamic environment

Compared with existing methods, the model has three advantages: (1) It gives a unified optimization objective, providing a reference for constructing multi-source information fusion localization method; (2) it realizes multi-source information fusion on the model level, fully mining the correlation and redundancy between multi-source information; (3) it has more flexible constraints, making the model scalable for any kind of high dynamic environments.

1.2 Semi-supervised Localization Model with Signals’ Fusion

Aiming at the problem of low location accuracy caused by the lack of large-scale labeled data, a semi-supervised localization model based on multi-source signals fusion is introduced here. This model combines the fitting error term of the labeled data and manifold constraint terms of the Wi-Fi and Bluetooth signals and optimizes the  objective equation by adjusting the weight coefficient of all terms. The experimental results [6] showed that the method based on multi-source signals fusion can achieve optimal location results when applied to the location problem of sparse calibration, and the location accuracy was higher than that of the existing supervised learning methods and semi-supervised learning methods.

Unlike previous single-signal-based semi-supervised manifold methods [712], it is better to combine the Wi-Fi and BLE signals into a single model. To the best of our knowledge, Wi-Fi and BLE signals have different propagation characteristics and effective distances. When considering both of Wi-Fi and BLE in a semi-supervised learning model, it should separately build the manifold regularization for each of them.

In accordance with the structural risk minimization principle [13], FSELM [6] used graph Laplacian regularization to find the structural relationships of both the labeled and unlabeled samples in the high-dimensional signal space. For the construction of a semi-labeled graph \(G\) based on \(l\) labeled samples and \(u\) unlabeled samples, each collected signal vector \(\varvec{s}_{j} = \left[ {s_{j1} ,s_{j2} , \ldots ,s_{jN} } \right] \in R^{N}\) is represented by a vertex \(j\), and if the vertex \(j\) is one of the neighbors of \(i\), represented by drawing an edge with a weight of \(w_{ij}\) connecting them. According to Belkin et al. [14], the graph Laplacian \(\varvec{L}\) can be expressed as \(\varvec{L} = \varvec{D} - \varvec{W}\). Here, \(\varvec{W} = [w_{ij} ]_{{\left( {l + u} \right) \times \left( {l + u} \right)}}\) is the weight matrix, where \(w_{ij} = \exp \left( { - \left\| {\varvec{s}_{i} - \varvec{s}_{j} } \right\|^{2} /2\upsigma^{2} } \right)\) if \(\varvec{s}_{i}\) and \(\varvec{s}_{j}\) are neighbors along the manifold and \(w_{ij} = 0\) otherwise, and \(\varvec{D}\) is a diagonal matrix given by \(\varvec{D}_{ii} = \sum\nolimits_{j = 1}^{l + u} {\varvec{W}_{ij} }\). As illustrated in Fig. 1.2, to consider the empirical risk while controlling the complexity, FSELM minimized the fitting error plus two separate smoothness penalties for Wi-Fi and BLE as (1.1):

Fig. 1.2
figure 2

FSELM model illustration (L1 and L2 are the graph Laplacians of the Wi-Fi and BLE signals, and λ1 and λ2 are the weight coefficients of the two manifold constraints)

$$\mathop {\text{argmin}}\limits_{\varvec{f}} \left\{ {\frac{1}{2}\left\| {\varvec{f} - {\mathbf{T}}} \right\|^{2} + \lambda_{1} \varvec{f}^{\text{T}} \varvec{L}_{1} \varvec{f} + \lambda_{2} \varvec{f}^{\text{T}} \varvec{L}_{2} \varvec{f}} \right\}$$
(1.1)

The first term represents the empirical error with respect to the labeled training samples. The second and third terms represent the manifold constraints for Wi-Fi and BLE based on the graph Laplacians \(\varvec{L}_{1}\) and \(\varvec{L}_{2}\). By adjusting the two coefficients \(\lambda_{1}\) and \(\lambda_{2}\), it can control the relative influences of the Wi-Fi and BLE signals on the model.

When applied to sparsely calibrated localization problems, FSELM is advantageous in three aspects. Firstly, it dramatically reduces the human calibration effort required when using a semi-supervised learning framework. Secondly, it uses fused Wi-Fi and BLE fingerprints to markedly improve the location accuracy. Thirdly, it inherits the beneficial properties of ELMs in terms of training and testing speed because the input weights and biases of hidden nodes can be generated randomly. The findings indicate that effective multi-data fusion can be achieved not only through data layer fusion, feature layer fusion, and decision layer fusion but also through the fusion of constraints within a model. In addition, for semi-supervised learning problems, it is necessary to combine the advantages of different types of data by optimizing the model’s parameters. These two contributions will be valuable for solving other similar problems in the future.

1.3 Motion Information Constrained Localization Model

For Wi-Fi fingerprint-based indoor localization, the basic approach is to fingerprint locations of interest with vectors of RSS of the access points during offline phase and then locate mobile devices by matching the observed RSS readings against this database during online phase. By this way, continuous localization can be summarized as trying to find a smooth trajectory going through all labeled points. Thus, in order to recover the trajectory, it still needs a certain number of labeled data, especially in some important positions (e.g., corners).

Considering that a user holds a mobile phone and walks in an indoor wireless environment with \(n\) Wi-Fi access points inside. At some time \(t\), the signal received from all the \(n\) access points is measured by the mobile devices to form a signal vector \(\varvec{s}_{t} = \left[ {s_{t1} ,s_{t2} , \ldots ,s_{tn} } \right] \in R^{n}\). As time goes on, the signal vectors will come in stream manner. After a period of time, a sequence of \(m\) vectors will be obtained from mobile phone and form a \(m \times n\) matrix \(S = \left[ {\varvec{s}_{1}^{\text{T}} ,\varvec{s}_{2}^{\text{T}} , \ldots ,\varvec{s}_{m}^{\text{T}} } \right]\), where the ‘\({\text{T}}\)’ indicates matrix transposition. Along the user’s trajectory, only some places are known and labeled, and the rest are unknown. The purpose is to generate and update the trajectory points which can form a \(m \times 2\) matrix \(P = \left[ {\varvec{p}_{1}^{\text{T}} ,\varvec{p}_{2}^{\text{T}} , \ldots ,\varvec{p}_{m}^{\text{T}} } \right]\), where \(\varvec{p}_{t} = \left[ {x_{t} \, y_{t} } \right]^{\text{T}}\) is the location of mobile device at time \(t\). Meanwhile, for each step, the user heading orientation can also be obtained from mobile devices in every time \(t\). Thus, while collecting the RSS, another vector of \(m\) orientation values can be generated: \(O = \left[ {o_{1} , \ldots ,o_{t} , \ldots ,o_{m} } \right]^{\text{T}}\). Here, \(o_{t}\) indicates the angle to north in horizontal plane, which is called azimuth. With the Wi-Fi signal matrix and the orientation vector, the mapping function should be \(f\left( {S,O} \right) = P\). By this way, it can supplement the location for these unlabeled data, reducing the calibration work.

The fusion mapping model \(f\left( {S,O} \right) = P\) from the signal space to the physical space can be optimized by \(f^{*} = {\text{argmin}}_{f} \mathop \sum \nolimits_{i = 1}^{l} \left| {f_{i} - y_{i} } \right|^{2} + \delta \mathop \sum \nolimits_{i = 1}^{l} \left| {o_{{f_{i} }} - o_{{y_{i} }} } \right|^{2} + \gamma f^{\text{T}} Lf\), where the first term measures the fitting error to the labeled points, the second term is the fitting error to the user heading orientation offered by mobile phone, and the third term refers to the manifold graph Laplacian.

It brings good performance for both tracking mobile nodes and manual calibration reduction in wireless sensor networks. This model is based on two observations: (1) similar signals from access points imply close locations; (2) both labeled data positions and the real-time orientations can help tracking the traces. Thus, it learned a mapping function between the signal space and the physical space conjoin a few labeled data and a large amount of unlabeled data, and the constraint of orientation obtained from mobile devices.

The experimental results [15] showed that this method can achieve a higher tracking accuracy with much less calibration effort. It is robust to reduce the number of calibrated data. Furthermore, if applying it for offline calibration, the online location is much better than some other methods before. Moreover, it can reduce time consumption by parallel processing while maintaining trajectory learning accuracy.

2 Fine-Grained Activity Recognition

Traditional activity recognition methods aim at discovering pre-defined activity with body-attached sensors such as accelerometers and gyroscopes. However, peoples’ activities are so diverse; they cannot be covered by some pre-defined activities. As the way the devices are worn, the location the devices are placed, the person who wears the devices, etc., which all lead to the decreasing the recognition accuracy. And it needs a large amount of labeled data to maintain the recognition performance. In this section, we will show the methods including transfer learning, generative adversarial networks (GANs), incremental learning to implement fine-grained activity recognition with less human labor.

2.1 Transfer Learning-Based Activity Recognition

The combination of sensor signals from different body positions can be used to reflect meaningful knowledge such as a person’s detailed health conditions [16] and working states [17]. However, it is nontrivial to design wearing styles for a wearable device. On the one hand, it is not comfortable to equip all the body positions with sensors which make the activities restricted. Therefore, we can only attach sensors on limited body positions. On the other hand, it is impossible to perform HAR if the labels on some body parts are missing, since the activity patterns on specific body positions are significant to capture certain information.

Assume a person is suffering from small vessel disease (SVD) [18], which is a severe brain disease heavily related to activities. However, we cannot equip his all body with sensors to acquire the labels since this will make his activities unnatural. We can only label the activities on certain body parts in reality. If the doctor wants to see his activity information on the arm (we call it the target domain), which only contains sensor readings instead of labels, how to utilize the information on other parts (such as torso or leg, we call them the source domains) to help obtain the labels on the target domain? This is referred to as the cross-position activity recognition (CPAR).

To tackle the above challenge, several transfer learning methods have been proposed [19]. The key is to learn and reduce the distribution divergence (distance) between two domains. With the distance, we can perform source domain selection as well as knowledge transfer. Based on this principle, existing methods can be summarized into two categories: exploiting the correlations between features [20, 21], or transforming both the source and the target domains into a new shared feature space [2224].

Existing approaches tend to reduce the global distance by projecting all samples in both domains into a single subspace. However, they fail to consider the local property within classes [25]. The global distance may result in loss of domain local property such as the source label information and the similarities within the same class. Therefore, it will generate a negative impact on the source selection as well as the transfer learning process. It is necessary to exploit the local property of classes to overcome the limitation of global distance learning.

This chapter introduces a Stratified Transfer Learning (STL) framework [26] to tackle the challenges of both source domain selection and knowledge transfer in CPAR. The term ‘stratified’ comes from the notion of splitting at different levels and then combining. The well-established assumption that data samples within the same class should lay on the same subspace, even if they come from different domains [27] is adopted. Thus, ‘stratified’ refers to the procedure of transforming features into distinct subspaces. This has motivated the concept of stratified distance (SD) in comparison to traditional global distance (GD). STL has four steps:

  1. 1.

    Majority Voting: STL uses the majority voting technique to exploit the knowledge from the crowd [28]. The idea is that one certain classifier may be less reliable, so we assemble several different classifiers to obtain more reliable pseudo labels. To this end, STL makes use of some base classifiers learned from the source domain to collaboratively learn the labels for the target domain.

  2. 2.

    Intra-class Transfer: In this step, STL exploits the local property of domains to further transform each class of the source and target domains into the same subspace. Since the properties within each class are more similar, the intra-class transfer technique will guarantee that the transformed domains have the minimal distance. Initially, source domain and target domain are divided into C groups according to their (pseudo) labels, where C is the total number of classes. Then, feature transformation is performed within each class of both domains. Finally, the results of distinct subspaces are merged.

  3. 3.

    Stratified Domain Selection: A greedy technique is adopted in STL-SDS. We know that the most similar body part to the target is the one with the most similar structure and body functions. Therefore, STL uses the distance to reflect their similarity. It calculates the stratified distance between each source domain and the target domain and selects the one with the minimal distance.

  4. 4.

    Stratified Activity Transfer: After source domain selection, the most similar body part to the target domain can be obtained. The next step is to design an accurate transfer learning algorithm to perform activity transfer. This chapter introduces a Stratified Activity Transfer (STL-SAT) method for activity recognition. STL-SAT is also based on our stratified distance, and it can simultaneously transform the individual classes of the source and target domains into the same subspaces by exploiting the local property of domains. After feature learning, STL can learn the labels for the candidates. Finally, STL-SAT will perform a second annotation to obtain the labels for the residuals.

2.2 GAN-Based Activity Recognition

Transfer learning methods are effective ways to label practical unknown data, but they are incapable of generating realistic data. But fortunately, GANs framework is an effective way to generate labeled data from random noise space.

The vanilla GANs framework was firstly proposed in 2014 by Goodfellow et al. [29]. Since the GANs framework was proposed, it has been widely researched in many fields, such as image generation [29], image inpainting [30], image translation [31], super-resolution [32], image de-occlusion [33], natural language generation [34], text generation [35]. In particular, a great many variants of GANs have been widely explored to generate images with high fidelity, such as NVIDIA’s progressive GAN [36], Google Deep Mind’ BigGAN [37]. These variants of GANs provide powerful methods for training resultful generative models that could output very convincing verisimilar images.

The original GANs framework is composed by a generative multilayer perceptron network and a corresponding discriminative multilayer perceptron network. The final goal of GANs is to estimate an optimal generator that can capture the distribution of real data with the adversarial assistance of a paired discriminator based on min-max game theory. The discriminator is optimized to differentiate the data distribution between authentic samples and spurious samples from its mutualistic generator. The generator and the discriminator are trained adversarially to achieve their optimization.

The optimization problem of the generator can be achieved by solving the formulation stated in 1.2:

$$\mathop {\hbox{min} }\limits_{G} V_{G} \left( {D,G} \right) = \mathop {\hbox{min} }\limits_{G} \left( {{\mathbb{E}}_{{z \sim p_{z} \left( z \right)}} \left[ {\log \left( {1 - D\left( {G\left( z \right)} \right)} \right)} \right]} \right)$$
(1.2)

The optimization problem of the discriminator can be achieved by solving the formulation stated in 1.3:

$$\mathop {\hbox{max} }\limits_{D} V_{D} \left( {D,G} \right) = \mathop {\hbox{max} }\limits_{D} \left( {{\mathbb{E}}_{{x \sim p_{data} \left( x \right)}} \left[ {\log D\left( x \right)} \right]} \right) + {\mathbb{E}}_{{z \sim p_{z}(z) }} \left[ {\log \left( {1 - D\left( {G\left( z \right)} \right)} \right)} \right]$$
(1.3)

The final value function of the min-max game between the generator and the discriminator can be formulated as 1.4:

$$\mathop {\hbox{min} }\limits_{G} \mathop {\hbox{max} }\limits_{D} \left( {D,G} \right) = {\mathbb{E}}_{{x \sim p_{data} \left( x \right)}} \left[ {\log D\left( x \right)} \right] + {\mathbb{E}}_{{z \sim p_{z} \left( z \right)}} \left[ {\log \left( {1 - D\left( {G\left( z \right)} \right)} \right)} \right]$$
(1.4)

Firstly, the original GANs framework was proposed to generate plausible fake images approximating real images in low resolution, such as MNIST, TFD, CIFAR-10. Many straightforward extensions of GANs have demonstrated and leaded one of the most potential research directions. Though the researches of GANs have gained great success in the field of generating realistic-looking images, the GANs framework has not been widely exploited for generating sensor data.

Inspired by the thought of GANs, Alzantot et al. [38] firstly tried idea of GANs to train the LSTM-based generator to produce sensor data, but their SenseGen is half-baked GANs’ framework. Both the generator and the discriminator in SenseGen are trained separately; that is, the training process of the generator in SenseGen is not based on the back-propagating gradient from the discriminator.

In order to improve the performance of human activity recognition when a small number of sensor data are available under some special practical scenarios and resource-limited environments, Wang et al. [39] proposed SensoryGANs models. To the best of our knowledge, SensoryGANs models are the first unbroken generative adversarial networks applied in generating sensor data in the HAR research field. The specific GANs models were designed for three human daily activities, respectively. The generators accept the Gaussian random noises and output accelerometer data of the target human activity. The discriminators accept both the real accelerometer sensor data and the spurious accelerometer sensor data from the generators and then output the probability of whether the input samples are from the real distribution. With the improvement of SensoryGANs, the research of human activity recognition, especially in resource-constrained environments, will be greatly encouraged.

Then, Yao et al. [40] proposed SenseGAN to leverage the abundant unlabeled sensing data, thereby minimizing the need for labeling effort. SenseGAN jointly trains three components, the generator, the discriminator, and a classifier. The adversarial game among the three modules can achieve their optimal performance. The generator receives random noises and labels and then outputs spurious sensing data. The classifier accepts sensing data and outputs labels. The samples from the classifier and the generator are both fed to the discriminator for differentiating the joint data/label distribution between real sensing data and spurious sensing data. Compared with supervised counterparts as well as other supervised and semi-supervised baselines, SenseGAN achieves substantial improvements in accuracy and F1 score. With only 10% of the originally labeled data, SenseGAN can attain nearly the same accuracy as a deep learning classifier trained on the fully labeled dataset.

2.3 Incremental Learning-Based Activity Recognition

With more labeled data, it becomes possible to get fine-grained activity. However, traditional sensor-based activity recognition methods train fixed classification models with labeled data collected off-line, which are unable to adapt to dynamic changes in real applications. With the emergence of new wearable devices, more diverse sensors can be used to improve the performance of activity recognition. While it is difficult to integrate a new sensor into a pre-trained activity recognition model, the emergence of new sensors will lead to a corresponding increase in the feature dimensionality of the input data, which may result in the failure of a pre-trained activity recognition model. The pre-trained activity recognition model is unable to take advantage of this new source of data.

To take advantage of data generated by new sensors, feature incremental learning method is an effective method. To improve the performance of indoor localization with more sensors, Jiang et al. [41] proposed a novel feature incremental and decremental learning method, namely FA-OSELM. It is able to adapt to the dynamic changes of sensors flexibly. However, the performance of FA-OSELM fluctuates heavily. Hou and Zhou [42] proposed the One-Pass Incremental and Decremental learning approach (OPID), which is able to adapt to evolving features and instances simultaneously. Xing et al. [43] proposed a perception evolution network that integrates the new sensor readings into the learned model. However, the impact of the sensor order is not considered.

Hu et al. [44] proposed a novel feature incremental activity recognition method, which is named Feature Incremental Random Forest (FIRF). It is able to adapt an existing activity recognition model to newly available sensors in a dynamic environment. Figure 1.3 shows an overview of the method.

Fig. 1.3
figure 3

Feature incremental random forest algorithm

In FIRF, there are two new strategies: (1) MIDGS which encourages diversity among individual decision trees in the incremental learning phase by identifying the individual learners that have high redundancy with the other individual learners and low recognition accuracy, and (2) FITGM which improve the performance of these identified individual decision trees with new data collected from both existing and newly emerging sensors.

In real applications, people may learn new motion activities over time, which is usually classified as dynamic changes in class. When a new kind of activity is performed or the behavioral pattern changes over time, devices with preinstalled activity recognition models may fail to recognize new activities or even known activities with changed manners. To adapt to the changes of activities, traditional batch learning methods require retraining the whole model from scratch. This will result in a great waste of time and memory.

Class incremental learning method is an effective way to address this problem. Different from batch learning, incremental learning, or online learning methods update existing models with new knowledge. In [45], Zhao et al. presented a class incremental extreme learning machine (CIELM), which adds new output nodes to accommodate new class data. With update to output weights, CIELM can recognize new activities dynamically. Camoriano et al. [46] employed recursive technique and regularized least squares for classification (RLSC) to seamlessly add new classes to the learned model. They considered the imbalance between classes in the class incremental learning phase. Zhu et al. [47] introduced a framework, namely the one-pass class incremental learning (OPCIL), to handle new emerging classes. They proposed a pseudo instances generating approach to address the new class adaptation issue. Ristin et al. [48] put forward two variants of random forest to incorporate new classes. Four incremental learning strategies are devised to exploit hierarchical nature of random forest for efficient updating.

In [49], Hu et al. proposed an effective class incremental learning method, named class incremental random forest (CIRF), to enable existing activity recognition models to identify new activities. They designed a separating axis theorem-based splitting strategy to insert internal nodes and adopt Gini index or information gain to split leaves of the decision tree in the random forests. With these two strategies, both inserting new nodes and splitting leaves are allowed in the incremental learning phase. They evaluated their method on three UCI public activity datasets and compared with other state-of-the-art methods. Experimental results show that their incremental learning method converges to the performance of batch learning methods (random forests and extremely randomized trees). Compared with other state-of-the-art methods, it is able to recognize new class data continuously with a better performance.

3 Pervasive Context Sensing

With the pervasiveness of intelligent hardware, more individual context can be sensed, which is meaningful to infer users’ life patterns, health conditions, etc. In this section, we will introduce context sensing methods with pervasive intelligent hardware, including sleep sensing, household water-usage sensing, etc.

3.1 Sleeping Sensing

Sleeping is a vital activity that people spend nearly a third of lifetime to do. Many studies have shown that sleep disorder is related to many serious diseases including senile dementia, obesity, and cardiovascular disease [50]. Clinical studies have reported that sleeping is composed of two stages including rapid eye movement (REM) and non-rapid eye movements (NREM). NREM can be further divided into light and deep sleep stages. During sleep, REM and NREM change alternately. Babies can spend up to 50% of their sleep in the REM stage, compared to only about 20% for adults. As people getting older, they sleep more lightly and get less deep sleep. Therefore, it is meaningful to find out the distribution of different sleep stages.

As sleep quality is very important for health, a lot of previous researches have been done on sleep detection. The methods of analyzing sleep quality mainly monitor different sleep stages. Recently, the technologies of recording sleep stages are divided into two categories. One category is polysomnography (PSG)-based approaches [51]. PSG monitors many body functions including brain (EEG), eye movements (EOG), skeletal muscle activation (EMG), and heart rhythm (ECG) during sleep. However, collecting the polysomnography signals or brain waves requires professional equipments and specialized knowledge. Another category is actigraphy-based approaches. Typical devices can be divided into the following two categories. The first category is wearable sleep and fitness tracker such as Fitbit charge 2 and Jawbone Up [52]. These devices primarily work by actigraphy. Several algorithms [53] utilized wrist activity data to predict sleep/wake states. The results have shown that the accuracy of predicting sleep/wake through recording wrist activity data approaches score using EEG data. But wearable sleep devices have some weaknesses because of accuracy concerns for sleep stages. These devices detect sleep stages based on logged acceleration data generated by body movement. This means if a user does not move, these devices have to rely on other auxiliary sensors. The second category is non-wearable sleep trackers such as Beddit 3.0 Smart Sleep Monitor. These are dedicated sleep trackers that users do not wear on wrist. They tend to provide more detailed sleep data. Many products use non-wearable smartphone sensors to assess sleep quality or sleep stage. An application called iSleep [54] used the microphone of smartphone to detect the sleep events. The method extracts three features to classify different events including body movement, snoring, and coughing. These non-wearable sleep trackers tend to use many sensors on smartphone and a lot of manual intervention to extract features.

Different from these works, the work [55] leveraged microphone without any other auxiliary sensor or much manual intervention to detect sleep stages. Acoustic signal collected by the microphone is sensitive enough to record information. After the acoustic signal is collected, the spectrogram visual representation is given. Specifically, the spectrogram is the magnitude squared of the short-time Fourier transform (STFT). It splits time signal to short segments of equal length and then computes STFT on each segment.

Once the spectrogram has been computed, they can be processed by the deep learning model. Deep learning is a new area of machine learning research. Its algorithms build a large set of layers to extract a hierarchy of features from low level to high level. Deep learning models include deep neural network (DNN), convolutional neural network (CNN, or ConvNet), etc. ConvNet [56] is the most efficient approach for image and speech recognition. The major difference between ConvNet and ordinary neural networks is that ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture and vastly reduce the amount of parameters in the network.

The convolutional neural network architecture and training procedure are shown in Fig. 1.4. Learning from the relatively good effect of the network configuration in image recognition, this configuration can improve the expression ability of ConvNet. At the same time, accumulating convolutional layers and pooling layers guarantees long-range dependence (LRD) of acoustic signal, which is more robust than conventional ConvNet architecture.

Fig. 1.4
figure 4

Illustration of the convolutional neural networks (CNNs) architecture and training procedure

During the training process, the goal is to minimize the loss function in backward propagation. The optimizers such as stochastic gradient descent (SGD) and Nadam are used to update the weights of hidden layers. The output of network is divided into three categories, deep sleep, light sleep, and REM.

3.2 Water-Usage Sensing

A person’s daily activities can be recognized by monitoring the infrastructures (e.g., water, electric, heat, ventilating, air conditioning) in the house. Infrastructure-mediated sensing has been recognized as a low-cost and nonintrusive activity recognition technique.

Several infrastructure-mediated sensing approaches for water-usage activity recognition have been proposed recently. A water-usage activity recognition technique was proposed by Fogarty [57], deploying four microphones on the surface of water pipes near the inlets and outlets. Froehlich et al. [58] proposed HydroSense, another infrastructure-mediated single-point sensing technique. Thomaz et al. [59] proposed an learning approach for high-level activity recognition, which combined single-point, infrastructure-mediated sensing with a vector space model. Their work has been considered to be the first one of employing the method for inferring high-level water-usage activities. However, the infrastructure of the house has to be remodeled in order to work in with the installation of the pressure sensors.

To solve the above question, a single-point infrastructure-mediated sensing system for water-usage activity recognition proved to be effective which has a single 3-axis accelerometer clinging to the surface of the main water pipe in the house [60]. The structure of the water pipe in the apartment can be seen in Fig. 1.5. The thick and thin gray lines represent the main water pipe and branches of the main pipe. The green circle and red star in Fig. 1.5 are water meter and the accelerometer, respectively.

Fig. 1.5
figure 5

Structure of the water pipe

The water-usage activity recognition system has six modules, which are:

  1. A.

    Data Preprocessing

    Normally, there exist some noises in the raw time series samples which should be filtered out. The median filter technique is employed in this data preprocessing module, and the filter window is set to 3.

  2. B.

    Segmentation

    The segmentation module is aimed at segmenting both the rugged segments (time series rugged samples) and smooth segments (time series smooth samples) from time series samples.

    First, sample windows are generated on the set of time series samples according to the sliding window mechanism; second, annotate each sample window to be rugged or smooth based on whether its standard deviation is no less than a threshold or not. At last, a rugged (or smooth) segment is defined as a time series rugged (or smooth) windows.

  3. C.

    Data Post - processing

    The data post-processing module is to make all the rugged segments generated in the previous module completer and more precise.

    First-stage post-process procedure: In the first circumstance, any smooth segment (in between two rugged segments), whose corresponding samples are no more than a threshold, is re-annotated to be rugged segments. After that, all the neighboring rugged segments make up a long-rugged segment.

    Second-stage post-process procedure: In the second one, any rugged segment (in between two smooth segments), whose corresponding samples are no more than another threshold, is re-annotated to be smooth segments. After that, all the neighboring smooth segments make up a long smooth segment.

  4. D.

    Feature Extraction

    Instances are generated by utilizing the sliding window mechanism again on each rugged segment. The feature extraction module is executed on each sub-segment. Eight features (0.25-quantile, 0.5-quantile, 0.75-quantile, mean value, standard deviation, quadratic sum, zero-crossing, spectral peak) are extracted from a window of sample values in each axis (x-axis, y-axis, or z-axis in the accelerometer device). In all, there are 24 features for each instance.

  5. E.

    Model Generation and Prediction

    All the instances are split into two sets (the training set and the testing set) with approximately the same size. Instances in the same segment are assured to put into the same set, since you do not want any water-usage activity to be apart.

    Support vector machine (SVM) is employed for model generation, and Gaussian kernel can be utilized as its kernel function. Two parameters—the kernel parameter and the penalty parameter—need to be set before starting the learning process. In the end, a classifier is constructed on the training set.

    The classifier is then employed to predict the labels of instances in the testing set (testing instances). These prediction results are recognized as SVM’s prediction labels for the testing instances.

  6. F.

    Prediction Results’ Fusion

    The prediction results’ fusion module is done by law of ‘The minority is subordinate to the majority’. Specifically, for each water-usage activity, the number of testing corresponds to the most testing instances. In the end, the prediction labels of all instances in the segment are replaced by the dominant water-usage activity. The prediction results of the rugged segment are fused finally.

    The nonintrusive and single-point infrastructure-mediated sensing approach in this chapter can recognize 4-class water-usage activities in daily life. Data is collected unobtrusively by a single low-cost 3-axis accelerometer attached to the surface of the main water pipe in the house, making the installation process much more convenient.

3.3 Non-contact Physiological Signal Sensing

Non-contact vital sign detection has received significant attention from healthcare researchers, for it can perform basic physiological signal acquisition without any interference to the user. The electrode-attached ones, such as electrocardiography (ECG) or respiratory detection instrument, need the user fixed in a particular place or to be worn by the user all day long. These approaches have a negative impact on the daily life of users, which cannot be used in many applications, such as sleep apnea monitoring, burned patients’ vital sign monitoring, and health care jobs that require long-terming monitoring.

The heartbeats and respiratory are common physiological signals, which can be used for sleep monitoring and abnormal body monitoring. At present, the traditional methods for heartbeat detection are electrocardiogram (ECG) and photoplethysmography (PPG). The traditional detection method for breathing is mainly measuring the air volume and flow of the nose and mouth through the breathing process. All these methods require direct physical contact with the user, and the electrodes, sensors, masks need to be placed close to the skin for physiological signal measurement. Although the measurement result is more accurate, it has a strong interference to the normal life of the user, greatly reduces the comfort of the user, and cannot achieve long-term monitoring of the physiological information of the user. Therefore, non-contact detection methods attract more interest recently.

The non-contact detection of heartbeat and respiratory can be achieved by many methods, such as camera [61], radar, Wi-Fi, ultrasonic [62]. The camera method is to perform heartbeat detection through face video and perform respiratory detection by using body video. The other methods mainly perform heartbeat and respiratory frequency detection by detecting chest vibration caused by respiratory and heartbeat. Among them, the radar method has better recognition effect when the user is still, for electromagnetic can penetrate the clothes or covers and most of it will be reflected when it reaches the surface of the human body.

The radar method also can be subdivided according to the principle of signal transmission and reception. The most used radar methods are Doppler radar, FMCW radar [63]. Also, there are many innovative radar methods are used in heartbeat and respiratory detection, such as UWB pulse radar [64], self-injection-locked radar [65], UWB impulse radar [66].

The Doppler Radar : The Doppler radar method measures a user’s chest movement via the return signal phase. Doppler radar transmits continue wave (CW) electromagnetic wave toward the user’s body, and the RF signal will be reflected from the skin and tissue of the body. The receiver acquires the electromagnetic signal and mixes the received signal with the transmitter signal for vital signal detection.

Recently, the coherent receiver is used by mixing the received signal with a quadrature mixer. The quadrature mixer mixes the original received signal and a 90-degree shifted signal with the transmitter signal to achieve two quadrature components. With this method, the NULL point of radar detection is avoided. The signal needs to be demodulation with linear or nonlinear demodulation methods to get the phase change containing \(x\left( t \right)\). Then the heartbeat and the respiratory signal can be achieved with signal processing methods or machine learning methods.

The FMCW Radar : The frequency modulated continuous wave (FMCW) radar can determine the absolute distance between the system and a target. The FMCW radar transmits variable frequency signal with a modulation frequency being able to slew up and down as sine wave, sawtooth wave, triangle wave, or square wave [63]. And for vital sign detection, if the target is a man, the received signal will contain the information of the chest movement. Then, the signal will have a frequency shift of the chest motion frequency. By detecting the frequency shift in the range information, the heartbeat and respiratory frequency can be calculated.

4 Conclusions

In this chapter, we show different ways to sense user’s location information, activity information, and other context information with the pervasiveness of intelligent devices under smart home infrastructure. In the future, with the development of Internet of things (IoT), edge computing, and cloud computing, the sensing ability in smart home will be unprecedentedly powerful. And the collaborative computing framework of the above three (IoT, edge computing and cloud computing) would be the trend, which can adaptively use the device and resource to optimally achieve the task, what is more, with the maturity of pervasive sensing techniques, it will bring more convenience to people’s daily life and make high-quality living possible.