1 Introduction

Due to the advancements in internet-of-things (IoT) technologies and their infrastructures, the proliferation of smart home services is gaining greater momentum across several ICT industries. IoT-based smart home systems combine heterogeneous ubiquitous devices and appliances that are connected together in order to provide smart services to homes’ inhabitants [1]. For instance, such smart services can monitor power consumption, control smart appliances, or recognize residents’ activities to provide healthcare services or detect critical medical conditions. As homes become more intelligent, more complex and technology dependent services appear to exist. Figure 1 shows an architectural design of a smart home system, in which multiple heterogeneous sensors are installed to recognize elderly residents’ activities.

Fig. 1
figure 1

Smart home systems

The home area network (HAN) defines the connection topology, in which sensors report their readings back to some central unit such as a server or a cloud. Several HAN network technologies are used in existing smart homes, such as X10, ZigBee, and Z-wave. X10 is a protocol that allows for remotely controlling appliances at smart homes [2]. It involves short-distance radio frequency that enables fast communication between transmitters and receivers. On the other hand, ZigBee and Z-wave support mesh network topology, in which there is no single path for the message to get to its designated destination [3, 4]. As technology converges and the cost of connectivity decreases dramatically, HANs are always connected to the internet via wireless mediums. However, such always-on connectivity raises Cybersecurity threats where HANs could be vulnerable to intruders’ attacks [5, 6].

Existing wireless network technologies are vulnerable to espionage or vandalism threats. This may lead to exploit this vulnerability to recognize sensors’ identities since the wireless communication mediums are public [7]. Therefore, it allows for identifying the life patterns of smart homes’ residents. Consequently, this may result in a cybersecurity breach that affects individuals’ privacy or leads to cybercrimes. Specifically, intruders try to learn the behavior of inhabitants by identifying the functionalities of sensors that have been installed at homes. This may lead to significant privacy issues that might cause dangerous crimes, facilitates whaling (a phishing attack that targets high profile individuals), or leaks sensitive information about homes’ residents. Sensors’ identity theft in smart homes is the process of identifying the types of the installed sensors, which results in discovering the events or activities that have been performed by a homes’ residents. Such knowledge allows intruders to commit fraud or other types of crimes. Furthermore, knowing sensors’ identities could manipulate the integrity of broadcasting data.

Current literature paid more attention to research solutions that preserve peoples’ data privacy while ignoring the devices that people use in their daily activities in smart homes. Such ubiquitous equipment can easily lead to useful knowledge, which breaches the privacy of people as well. For instance, in [8] researchers investigated the preservation of users’ privacy in social big data. Other research proposed a novel technique for securing users’ privacy in smart mobile applications [9]. While in [10], researchers utilized an authentication method to preserve the privacy of users who benefit from IoT services.

However, a rare interest has been noticed in preserving sensors’ identities that are embedded in smart devices and appliances. Research in [11,12,13,14] have focused on several authentication methods to mitigate the risk of data breaches. While authentication and authorization paradigms have proven their effectiveness in preventing unauthorized access to network activities, surpassing such techniques becomes easier as intruders interrupt wireless signals and sniff data packets from the open space, not edge nodes.

This paper introduces a framework for protecting sensors’ identities in smart homes using a novel data-driven technique. The proposed methodology defines the sensor’s identity problem in smart home environment as a binary classification problem, which measures how likely an intruder can predict the identities of smart home sensors. Our proposed approach relies on defining an extra data-level, which partition sensor readings into a set of scrambled signals that, in turn, can be aggregated into a meaningful and readable record at the destination. The proposed protocol preserves the identity of the data source while keeping data granularity at lower levels. In other words, it does not add extra load on the network medium.

Given a set of sensors that are connected via HAN in a smart home such as \(H = \left\{ {s_{1} ,s_{2} , \ldots ,s_{n} } \right\}\) and a set of readings that have been resulted from each type of these sensors such as \(S_{k} = \left\{ {d_{k,1} ,d_{k,2} , \ldots ,d_{k,l} } \right\}\), the set of all possible data that might be generated from this network is defined by the function \(H \cdot S\). Furthermore, we define the function as follows:

$$f\left( {H \cdot S_{k} } \right) = H \times \bigcup {\left( {i = 1} \right)^{N} d_{k,i} }$$
(1)

Through this research, we will prove that \(f\left( {H \cdot S_{k} } \right)^{ - 1}\) is the inverse of the original function with extremely low probability of being detected or predicted by intruders. Moreover, we will test our proposed methodology using well-known and efficient binary classifiers to measure its performance in terms of misclassification rate, loss function, and the sensitivity of our proposed methodology to running time, detection rates, enhancements, and its effect on each classifier.

This paper is organized as follows: Sect. 2 discusses the related work in the literature and highlights the contribution of this research as compared to existing ones. Section 3 introduces and explains the proposed methodology. Section 4 illustrates the experiments and the anticipated results. Finally, Sect. 5 concludes the research work.

2 Literature Review

This section discusses the related research from three perspectives: the communications among smart home devices and appliances, the existing research on identity theft, and the application of data-driven solutions to the identity theft problem in smart homes.

Home area networks have been developed to define the operational connectivity among devices and appliances at smart homes. The main differences of such interconnectivity are the need to connect several equipment with fast transmission medium, reliable connection, low-load, and ad-hoc connectivity that allows for multiple and heterogeneous nodes [15]. HANs have been successfully implemented to monitor the daily operations of home appliances, such as turning the light on or off, controlling the home temperature, or providing voice command system to monitor home appliances.

Recent advances of IoT technologies added another level of complexity; it is the need to collect data and make the decision based on the behavioral patterns of home inhabitants. In other words, the operations of different appliances became no longer independent. In addition, homes are also connected to other homes, hospitals, schools, cars, and other data sources to formulate smart cities [16]. Such complexity evolves the concept of HAN to include extra services to cope with current technology and the emerging need for smart services.

Connectivity in an ad-hoc environment, in which new appliances can be added, and their locations are changing over time, has been handled through embedding appliances with wireless sensors [17]. Consequently, the open spectrum medium of wireless sensors communications raises the concerns of security and privacy issues [18]. For this reason, communication protocols for HANs have been developed to provide the required functionality, specifications, and preserve the privacy of such networks.

ZigBee is a bidirectional radio frequency protocol that adopts the wireless networking standards of IEEE 802.15.4. ZigBee technology has been widely used in developing HANs as it provides low data transmission communication and, consequently, long-life battery [19]. Since ZigBee has been articulated on the top of wireless mediums, its designated architecture makes it subject to intrusion attacks as all appliances are connected via a single coordinator (controller). In addition, connecting ZigBee networks to an external internet connection or Wi-Fi requires extra equipment and complexity, which makes ZigBee not suitable for IoT connectivity standards.

On the other hand, Wi-Fi provides high transmission rates as compared to ZigBee; permitting appliances that require streaming, synchronous communications, and fast response to function in smart homes [20]. Wi-Fi is also a bidirectional radio frequency protocol; implements IEEE 802.11 standards. While Wi-Fi is considered one of the most reliable and trusted connectivity medium, it is prone to interference due to the open spectrum environment of wireless communications.

To overcome intrusions, several researches have investigated the application of a secure layer on RFID (Radio Frequency Identification) technology [21], in which identification is performed through RFID tags. Unfortunately, it is hard and inefficient to replace sensors embedded in appliances with RFID tags to formulate an ad-hoc network. Furthermore, IoT connectivity seems impossible with RFID protocol, which relies on EPC (Electronic Product Code) protocol that has been developed to track items rather than facilitate communication among devices.

The process of identity theft attack (ITA) is based on scanning the networks to detect unsecured and weakly configured connections. Once detected, the attacker copies the identity and uses it to access private information. The purpose of attacks is varying [22] and can be classified into the following: obstruction of data, counter international cyber security measures, retardation of decision making, denial in providing public services, abatement of public confidence, and other goals.

Our problem is complicated in terms of securing the type of sensor (device) rather than securing the communicating data; protecting sensor identity. Specifically, in smart homes, if the attacker detects the functionality of a sensor, it will be easy to understand the behavior of smart home’s resident [23]. Our goal is to protect the identity of the sensors rather than protecting the data generated by them.

There are several solutions in the literature with techniques to detect such types of Cybersecurity attacks. Unfortunately, most of these techniques are not applicable for securing sensor identity, which is the primary constituent of modern smart homes. For instance, the well-known solution that keeps track of neighbors with their locations so that the designated base station can regulate the communication [24], is not applicable in this domain. Such a traditional solution secures the communicating data rather than the identities of sensors [25].

TOR-based anonymous communication among smart home appliances has been proposed in [26], in which, authentication phase has been omitted. This approach is based on public-key cryptography, which is very expensive in terms of processing time and memory. Another interesting solution is the lightweight authentication sessions that have been proposed in [27]. It relies on establishing a token-based protocol to legitimate the identity of a smart device in which a centralized state table is kept to manage this process. The proposed solution guarantees communication security while ignoring anonymity.

Santoso and Vun [28] have proposed a more specific solution to IOT systems by considering the user convenience aspect. The proposed solution is based on establishing a shared key among different IOT sensors using Elliptic Curve Diffie Hellman (ECDH) primitive. In [29] and [30], multi-level and multi-tier schemes were implemented to expand the ECDH functionality. Neither ECDH nor its expanded versions guarantee to keep the type of sensor secure from attackers.

The easiest way to detect the type of wireless sensors embedded in smart devices at modern smart homes is the analysis of the data that are generated by these sensors [31]. For instance, a motion sensor on the restroom door can be interpreted as the inhabitant need to count the number of times this room is used. Another sensor is measuring the level of insulin of a specific resident. Simply, both facts are correlated, since insulin can be easily mapped to diabetes, in which going to the restroom is a major symptom. Therefore, the attacker can conclude that the resident is infected by diabetes.

Data-layer technologies can provide simple and efficient solutions to handle the identity theft of sensors in smart homes. The ultimate goal of such solutions is to hide the semantic of communicating data; rather than hiding their values [32, 33]. However, detecting the semantic of communicating data is not a simple task. Recent advances in machine learning techniques (such as deep learning) facilitate detecting behaviors through data history.

Another important direction is the adoption of time sensitive networks [34], which can be utilized to centralize the control and management of traffic streams as a scheduling problem. Such configuration enhancement would positively minimize the overall communication time and allows for enforcing security rules among all connected sensors. Although this research adopted a distributed algorithmic methodology, it could be upgraded, in the future, to more centralized architecture.

Table 1 summarizes the related work and compares among existing methods in terms of their contributions and research directions.

Table 1 Summary of related work

3 Sensors Identity Protection

This section introduces our methodology to protect sensors’ that are installed in a home area network. The proposed methodology is driven by a verification phase, in which it has been verified during the modeling of every phase against reliability issues such as concurrency. First, we describe the common communication model in the home area network, which clarifies the implementation environment. Next, we illustrate how the attacker can benefit from such environment to identify the sensors and then learn some meaningful information. Finally, we provide algorithmic descriptions to our proposed technique that detailed the execution of different phases in addition to the way we verify each of them.

3.1 Communication Model

Home Area Networks (HANs) are described as networks in which several smart home appliances are connected to communicate fine-tune messages. While the communication topology and installation architecture are not a measure contribution of this research, this section shows the basic components of HANs that affect our perspective toward preserving the privacy of the smart sensors as part of such network. Figure 2 shows the general architecture of HAN that consists of connected smart appliances and the gateway, through an internal modem, to the internet.

Fig. 2
figure 2

Smart area network (HAN)

Given a set of appliances in a smart home, where every appliance is equipped with a special-purpose sensor and maintains a look-up table. The set of sensors that are attached to home appliances are defined as \(D = \left\{ {d_{1} ,d_{2} , \ldots ,d_{n} } \right\}\). Once an appliance joins the network, the gateway controller assigns a unique number that uniquely identifies each appliance in the network (network address). The set of identifiers is defined as follows:\(ID = \left\{ {id_{1} ,id_{2} ,id_{n} } \right\}\). Furthermore, during the initialization phase, an appliance constructs a look-up table that maintains information about other home appliances, type of messages, and the responding protocol.

As shown in Fig. 2, two communication schemes apply to this network: internal and external. Internal communications follow the IoT standard communications, in which devices are communicated directly (P2P). On the other hand, external communications are performed via the HAN gateway to connect the home network with external ones (the Internet). Communication messages, in this network, are following a predefined protocol. A message carries on information from the sender that is well-known to the receiver. For instance, a sensor broadcasts the temperature of the room. Once this message has been received by the air-conditioner, it will cool or heat the room, or even turn the power off. Accordingly, every device has its own reaction against broadcasted messages at the smart home.

The semantic of the data attached in each communicating message is totally dependent on the source sensor; i.e. the device ID. In other words, the receiver cannot interpret the message correctly and pick-up the appropriate reaction without knowing the ID. The traditional privacy preserving model relies on encrypting the communicating messages using the well-known public-private keys framework. Since every appliance has a public key and its own private one, it can easily identify the message content. Formally, given a message \(m_{ID}\) and public key \(C_{k}\), the sender applies \(Enc\left( {m_{ID} , C_{k} } \right)\) to encrypt the message, while the receiver applies \(Dec\left( {m_{ID} , C_{k} } \right)\) to reverse the encryption. Figure 3 simplifies the secured communication paradigm in HANs.

Fig. 3
figure 3

Secured framework for communicated messages

3.2 Sensor Identity Attack (Use-Case)

Each home appliance has unique characteristics that lead to identifying its functionality. Data analysis and machine learning tools make it easy to analyze the patterns of communicating data to identify their semantics, which raise the vulnerability that attackers may exploit it. Although encrypting communicating messages helps preserve the data privacy, it will not prevent an attacker from learning their identity as the source location of these sensors cannot be hidden. Sensor identity, in this context, is defined as the distinguished job that a sensor provides when it is actively functioning. For instance, it is not that hard for attackers to predict the identity of home appliances in the kitchen. It is a matter of time to identify every device and its associated sensors once the attacker knows the source of messages. In this case, an intruder can collect useful information that is not directly related to exchanged data; it is enough for the attacker to know that there are some appliances have turned-on in the kitchen in a specific time to conclude that the resident is at home or the other rooms are empty. Figure 4 explains the way an intruder can attack the network through the cordless communication scheme.

Fig. 4
figure 4

Identity attacks

To formulate the use-case, we are looking for a technique to frequently change sensors’ identity overtime. While, at the same time, all other home appliances are still able to identify each other. In other words, we are trying to benefit from the fact that internal communications can be highly controlled and manipulated so that external intruders cannot understand their paradigms.

3.3 Identity Preserving Protocol

This section explains our proposed 3-phase technique for setting up the environment and harmonizes the implementation of an identity preserving algorithm among other communication components. Mainly, the three phases are: initialization, concealment, and communication. The following subsections provide a detailed algorithmic description of each phase.

3.3.1 Initialization Phase

This phase presents the initial actions to set up the sensors into the home area network. The process involves defining the essential parameters and actions that should be executed to deliver the required services. Algorithm 1 defines a queue that maintains information about every sensor in the network. This queue holds the actual information about every sensor and is required to be updated every specific amount of time. Line 2 defines the private key of the current sensor as a function of the public key and certificate. It follows the traditional IEEE standards of defining public-private keys. Line 3 defines the time variable; which is associated with the current sensor and does not require to be synchronized with similar variables in the sensors network. For these lines (2, 3, and 4), the time complexity that is required to execute them is \(O\left( 1 \right)\).

Line 4 defines a join function that allows the sensor to join the home network. The joining process involves three actions. The first one is to set up networking addresses and a dedicated connection to the network gateway. The second action is to collect information about neighboring sensors and push them into the sensor’s queue. The third action is to establish a permanent control connection in which commands and acknowledgments are passing through. The time complexity to run these actions is \(O\left( N \right)\), where N is the total number of sensors in the home area network.

figure a

To protect sensor identity, the initialization algorithm uses a random integer at each time. Lines 5, 6, and 7 are responsible for picking up a random integer that will be used later to encrypt the identity of the sensor. Consequently, this will hide the pattern of the sensor’s activities. The time complexity that is required to execute both lines is \(O\left( 1 \right)\). The threshold variable defines the range of numbers that can be assigned to the time variable. The higher the value of the threshold variable implies a lower chance to recognize the identity of the sensor. However, it plays a significant role in the performance of the communication, since the correlation between the threshold value and the communication performance is strongly negative. Line 8 broadcasts an acknowledgement packet confirming the new arrived information to other neighboring sensors with time complexity \(O\left( 1 \right)\).

3.3.2 Concealment Phase

During this phase, every sensor is used to hide its functionality, broadcast, and receive others’ information. This thread is executed concurrently and is strictly dependent on the environment settings. In Algorithm 2, Lines 1 to 3 are repeatedly used to generate a random time value, encrypt sensor’s identity, and broadcast information to other neighboring sensors.

The while-loop in this algorithmic pseudo code is used to receive identities’ acknowledgements from other sensors, decrypt them, and update the sensor’s queue. Furthermore, the sensor acknowledges every packet to guarantee reliability. The time complexity (for N sensors) that is required to execute this thread is \(O\left( {N^{2} } \right)\), since the function ‘receive’ requires N times and the function ‘push’ requires N times as well.

figure b

3.3.3 Communication Phase

The main communication module defines how sensors execute the whole three phases sequentially and/or concurrently. Further, it consists of three threads that are responsible for sensing, receiving, and synchronizing the maintenance of system queues. As shown in Algorithm 3, at the very beginning, the module is initializing the environment and setting the time-threshold value. This value is globally accepted. The first thread is responsible for sensing the environment and reporting back the results in an encrypted message. The receiving thread is responsible for waiting until other neighboring sensors send their information. The synchronization thread, finally, maintains the system queues.

figure c

The concurrent implementation of these three threads raises the problem of interleaving, in which a thread may access the same resource while it is used by another one. Threads, in this context, are triggered according to the sensing environment. For instance, the sensors are sat to read the environment every amount of time. Each sensor has its own setting. On the other hand, once a sensor reports a change of its identity, others should synchronize their queues. Thus, the interleaving among threads may often occur during the lifetime of the network.

3.4 Model Verification

Each phase of the proposed protocol is designed to be implemented as an embedded code in every cooperative sensor. Therefore, sensors may run similar threads simultaneously, which raises the problem of how main parameters can be synchronized. For instance, queues’ values must be known for other sensors at the end of the sensing phase.

Since concurrency is a major issue, in this technique, model verification is necessary for ensuring that the system handles interleaving among its running threads. For this reason, we use model checking as a tool to ensure that the proposed technique achieves the overall system specifications in terms of temporal aspects. As described in [35, 36], model checking verification is a tool that can detect whether a model conforms to a specific requirement specification or not. It cannot tell that the system is error free, but, on the other hand, it can tell with a counter example that the system violates a given requirement.

To involve model checking in our design, we use UPPAAL [37] to formally describe the interactions among threads in our proposed technique. UPPAL defines the system as a set of states and transitions. The tool runs all models concurrently and tests their verifiability in terms of a given set of temporal specifications. Figure 5 shows the finite-state automata for every thread.

Fig. 5
figure 5

UPPAL models of systems threads

For this reason, we designed 10-specifications, shown in Table 2, to verify that our proposed technique maintains reliability in terms of concurrency, resource sharing, and safety. Our modeling of the whole technique was controlled by checking the temporal specifications of this system. Eventually, the proposed technique proved its ability to maintain concurrency during execution. The properties were used are: Mutex, Deadlock, Starvation, Liveness, Activation, Reachability, and Safety.

Table 2 System temporal specifications

4 Experiments and Results

This section describes and explains a set of experiments that have been conducted to evaluate the performance of the proposed technique. There are mainly two types of experiments: performance evaluation and sensitivity analysis. The first experiment is focusing on testing how our proposed technique enhances the misclassification rates based on the assumption that sensors identities would be hard to be identified if existing data features do not clearly point to them. The second experiment, on the other hand, shows how our proposed technique is sensitive to the threshold value range and queue size.

To perform such experiments, we applied our technique on four well-known datasets that have been collected from smart homes in four different countries obtained from CASAS project [38]: France-Tulum, Egypt-Cairo, Italy-Milan, and Japan-Kyoto. Every dataset comprises instances covering a finite set of activities. Actions are generated using motion, temperature, or detection sensors. Table 3 briefly describes the datasets. Note that every dataset has an attached map that shows the location of sensors, which can be interpreted as the location where a specific action triggers.

Table 3 Description of datasets

Instances in these datasets represent the daily activities of a single resident; they have been collected and labeled manually, so that experiments could be supervised using already annotated instances. Note, only active sensors have been used, because some sensors were not active in the testbed during the data collection process. Figure 6 shows the planted sensors in the Milano smart home.

Fig. 6
figure 6

The distribution of the sensors in the smart home (Milano-Italy)

Moreover, we also conducted the preprocessing task to assemble all actions that are related to each activity; since each activity consists of a set of actions. This preprocessing task is known as data segmentation, in which each segment is a record that aggregate information from multiple actions to represent a specific activity.

Finally, all experiments have been conducted using Python 2.2 and the machine specifications are as follows: Intel(R) Core(TM) i5-4200 M CPU@2.50 GHz, Physical RAM4.00 GB, Windows 7 Professional Edition. Moreover, data visualization and results analysis have been developed using MATLAB.

4.1 Performance Analysis

In this section, we present our experiments to measure the performance of the proposed technique. We designed the experiment as a binary classification problem to measure the ability of different classifiers to identify the sensors’ identities (class labels). For this reason, we chose the misclassification rate (MC) as a measure of the classifiers’ disability for recognition (i.e. the higher the MC value implies the higher the disability of a classifier to identify the sensor identity). The misclassification rate is defined as follows:

$$MC_{rate} = \left( {FP + FN} \right)/\left( {TP + FN + TN + FP} \right)$$
(2)

Where \(TP\) is the true positive value, \(FN\) is the false negative value, \(FP\) is the false positive value, and \(TN\) is the true negative value. The misclassification rate shows how our approach can confuse these classifiers (intruders) in recognizing the sensors’ identities. First, we ran the classifiers on the original data and measure the misclassification rates. Next, we applied our technique on the original data at different threshold values and feed the classifiers with the new datasets to measure the misclassification rates after applying our proposed technique. To implement this experiment, we chose six-classification algorithms: K-Nearest-Neighbor (KNN), Hidden Markov Model (HMM), Support-Vector Machine (SVM), Decision Tree base classifier (J48), Naïve Byes (NB), and Conditional Random Field (CRF). These algorithms were widely applied in similar research such as [1, 5, 16, 17]. Moreover, we used to split every dataset into training and testing sets (2:1 ratio) in which both of them were fixed; to prevent testing the classifiers on different testing sets.

As shown in Table 4, the misclassification rates have been increased significantly as the threshold value increased. This indicates that at high threshold time interval, the proposed technique was able to achieve high misclassification rates, which implies its ability to decrease the detection rates. In other words, an intruder who owns the features or the sensors data has a lower chance to recognize the source identity.

Table 4 Misclassification Rates of Data Classifiers at different threshold values

We performed additional experiment to ensure that the overall performance of the proposed technique satisfies our goal; minimizing the detection rate of given classifiers. In this experiment, we used to apply the loss rate formulas to provide an indicator of how it is complex to identify an entity using existing features. Indeed, the higher the loss rate values the better for protecting the class labels (sensor identity). The loss rate is defined as the square root of the TP complement and the FP value as follows:

$$Loss = \sqrt {\left( {\left[ {1 - TP} \right]^{2} + \left[ {FP} \right]^{2} } \right)}$$
(3)

Note that the value of the loss rate is greater than or equal zero; the maximum might exceed 1 as it depends on the TP value. The lower the TP value the more chance of loss rate to exceed 1.

As shown in Table 5, for all classification algorithms and datasets, the loss rates have been increased significantly. This implies that applying the proposed technique significantly complicated the process of recognizing the class labels (sensors’ identities). On the other hand, we noticed that the loss rate affected by the shape of the datasets; the dataset with more activities and sensors achieves higher loss rates as compared to others.

Table 5 Average loss rates at different threshold values

Finally, we conducted statistical testing to measure the significance of the differences that have been achieved after applying the proposed 3-phase algorithm. At 99% confidence, the differences (enhancement and loss rates) that have been achieved were statistically significant at \(p < 0.001\).

4.2 Sensitivity Analysis

This section discusses how the threshold value affects the overall performance of the proposed technique in terms of time cost, detection rates, enhancements, and its effect on each classifier. The proposed technique has benefited from the threshold parameter to expand the range of scrambled values that could be assigned as sensors used to encrypt their identities over a time ranges from zero to the threshold value. For this reason, we consider how this parameter affects the physical running time in terms of milliseconds. As shown in Fig. 7, the actual time that is required to execute the proposed algorithms increases exponentially until a certain threshold value, then it became stable as the threshold value increases. This implies that at high threshold value, the required time to execute the 3-phase algorithm is linear.

Fig. 7
figure 7

Sensitivity of time cost to threshold values

On the other hand, we noticed that the effect of the low threshold value on the detection rates is low but enhanced significantly as the threshold value increased. Figure 8 shows that at average threshold value, the detection rate decreased significantly; making the protection from sensor identity theft efficient and effective.

Fig. 8
figure 8

Sensitivity of detection cost to threshold values

In addition, we investigated the enhancement rates that have been achieved by each classifier to notice their behavior against different threshold values. According to Fig. 9, all classifiers achieved enhancements as the threshold value increased. The SVM classifier achieved a higher enhancement at the high threshold value, while KNN achieved a higher enhancement at the lower threshold value.

Fig. 9
figure 9

The effect of threshold value on the enhancement rates

Finally, we investigated how the datasets specifications (such as number of activities, number of sensors in the home area network, etc.) affect the performance of different classifiers by considering different threshold values. This experiment is important as it gives valuable knowledge on how to pick up the best threshold value for specific smart home settings.

As shown in Fig. 10, KNN tends to perform better as the size of the dataset is smaller. The misclassification rate of Kyoto dataset (smallest size) was lower among other datasets, while it was the highest for Tulum (larger size). On the other hand, HMM (Fig. 11) showed no clear pattern on how it performs against the dataset’s specifications. We noticed that most datasets achieved similar MC rates at the high threshold value.

Fig. 10
figure 10

KNN MC performance

Fig. 11
figure 11

HMM MC performance

As it was the lowest performance in terms of MC rate, SVM (Fig. 12) tends not be affected by the specifications of the datasets. We believe that converting the features, during the preprocessing, to formulate vectors of frequencies was the reason behind these results. Figure 13 shows that J48 achieved high detection as the dataset size is small; except in the case of Kyoto dataset at a high threshold value.

Fig. 12
figure 12

SVM MC performance

Fig. 13
figure 13

J48 MC performance

Figure 14 explains the performance of NB classifier as it appears to be affected by the dataset’s specifications for the largest datasets (Tulum and Milan), while the effect was opposite at the lowest ones (Cairo and Kyoto). CRF (Fig. 15) shows no pattern on how the classifier is affecting the dataset’s specifications, since the performance fluctuated as the threshold values increased.

Fig. 14
figure 14

NB MC performance

Fig. 15
figure 15

CRF MC performance

Accordingly, the results indicated that there are not enough proofs to conclude the effect of datasets specifications on the performance of the classifiers; except in few cases that cannot be generalized. Therefore, the proposed technique was able to perform well in protecting the sensors identities in smart home regardless of how the collected features are processed.

5 Conclusion

This paper introduces a novel approach to protect the sensors’ identities of smart homes. The proposed model aims to increase the security level of smart home devices and appliances, which reduces the risk of identifying sensors’ functionalities. The proposed approach applied three-phase technique that manages a synchronized space among connected sensors and prevents the identification of sensors from outsiders. Furthermore, the proposed solution preserved the linearity of time required to manage the protection of the home network sensors’ identities. The empirical results showed significant performance enhancements in protecting sensors’ identities in smart home area networks. Additionally, the experiments highlighted the impact of the threshold value that defines the time interval for each sensor on the model performance.