1 Introduction

In our decade, the IoT has seized our daily life. Therefore, it redefined the existing services and applications, by involving various technologies, such as Wireless Sensor Networks (WSNs), Cloud computing, Cyber-Physical Systems (CPSs) in order to create new opportunities for customers and end-users (Miorandi et al. 2012). Furthermore, Li et al. (2015) extend the IoT concept to touch the autonomic computing (AC); in fact, this extension can cover the maintenance, tolerance problems, management, and security issues under massive numbers of connected things in the IoT systems (Miorandi et al. 2012; Whitmore et al. 2015).

In the IoT system, it is very important to deliver high-level services in terms of the quality of service (QoS) parameters, such as reliability, performance, and security quality factors under the high distribution and hyper-connectivity of things. Hence, service management is necessary to ensure the QoS factor, which includes the detection of system misbehavior, suspected events, incomplete tasks (device failure), and denial of service attacks. Additionally, the diagnostic of the detected problem and the establishment of the recovery plan.

Furthermore, the challenge level rises under the massive growth of distributed things to meet the requirement and to ensure the quality levels. The IoT devices will cover 72% of all the existing objects connected to the internet and 45% of Internet traffic by 2025. By the same year, the global business will rise by 60% from 58.1 billion dollars in 2013 to reach 100.8 billion dollars (Al-Fuqaha et al. 2015). In fact, in the security level, the IoT-security field exhausted over 348 million dollars of the market in 2016; the IoTSec (2019) mentions in its report that:” By 2020, over 25% of identifying enterprise attacks will involve IoT, though IoT will account for only 10% of Information Technology security budgets”.

Under these overlays of statistics, and in the presence of abnormal conditions or situations, the autonomic behavior (adaptation and management) is necessary to achieve functional systems (Whitmore et al. 2015). Hence, the assignment of autonomic roles to the things gives adaptation ability to deal individually with unpredictable events and healing from security issues. Moreover, it enables things to perform tasks collaboratively, especially if the system was developed collaboratively and has the planning capability to achieve an individual objective by discovering the available contextual services (neighbors-provided services) (Baitiche et al. 2018).

Additionally, it is very exhausting and expensive to detect and repair the system components using centrally healing strategies. Besides this, it still’s challenging to integrate the traditional self-healing approaches without according it to the IoT-system characteristics. Furthermore, the hybrid aspect of IoT components designed in sub-systems required more enhancement of healing ability to ensure the quality factors, such as security and reliability (Kühn et al. 2018).

Overall, the self-healing approaches have been integrated into IoT systems based on two dimensions: the device’s (software or hardware elements) healing and the network healing. Generally, the first dimension uses either the run-time verification or the formal specification in order to test the satisfaction of the healing proprieties in the system or the files-configuration monitoring to detect the existence of malicious codes/faults in the system controller. Although, this dimension depends on the developed software and the chosen hardware to build the final-IoT product. On the other side, the applied approaches in the network-healing dimension capable of diagnosis problems, such as congestion, high traffic, or link failure, and recovery from them under the non-standardization of the IoT connection protocols.

In this context, several contributions reviewed the DDoS attacks and failure in IoT systems with the existing or the absence of autonomic properties or the Cloud technology. Fruhlinger (2018) includes the Cloud as an intermediate point for communication between the IoT devices to protect them from each other. Other researchers deal with it as a network problem; Bajunaid (2015) controls the device data using the Cloud and the AC in order to manage the IoT devices against the failure in the Amazon Web Services (AWS) Cloud platform. Besides, Cirani et al. (2018) and Kim et al. (2017) present the healing process as a resilience solution for IoT gatewaysFootnote 1. Additionally, several research contributions are elaborated only for the IoT device in generic architecture, such as Kühn et al. (2018); Montoya et al. (2018), and WISeKey (2016); in fact, those solutions are inadequate for the collaboration to heal the IoT services under DDoS attacks or failure in the hybrid system.

This paper proposes a solution to solve the problem of service availability based on the integration between self-healing and self-protection attributes. Specifically, it uses the autonomic computing and the CloudIoT paradigm to deal with denied services in IoT devices. The proposed protocol, transparently, relocates during run-time the provided services of the vulnerable/failed devices to either another contextual agent or the \(S^2aaS\) model (in the Cloud platform based IoT) in order to ensure the availability of provided service after discovering it.

The remainder of this paper is structured as follows. Section 2 presents the primitive concepts of the IoT ecosystem and the CloudIoT paradigm, an overview of DDoS attacks, and the autonomic computing proprieties. Section 3 defines the problem space that has been targeted to be solved. Section 4 provides details on the related works. Section 5 includes the proposed protocol to answer the research questions, where Sect. 6 presents the performance evaluation of it. Section 7 presents the implementation and experimental results of the proposed protocol. Section 8 discusses the proposed solution and compares it with the mentioned related works. Finally, Sect. 9 contains the conclusion of this work.

2 Preliminary concepts

In this section, we provide the preliminary concepts of the defined problem and we highlight those related to our contribution by giving a motivation for the selected elements according to the proposed protocol.

2.1 IoT ecosystem

2.1.1 IoT definition and oriented visions

The definition of IoT technology has evolved over the years, according to different perspectives or needs. In the following, we cite two different definitions: One by Botta et al. (2016) and the other one by Kranenburg (Ray 2018).

First, Botta et al. (2016) defined the IoT as: “a paradigm based on intelligent and self-configuring nodes (things) interconnected in dynamic and global network infrastructure”. Second, Kranenburg (Ray 2018) defined the IoT as: ”a dynamic global network infrastructure with self-configuring capabilities based on standard and interoperable communication protocols where physical and virtual ’Things’have identities, physical attributes, virtual personalities, use intelligent interfaces and are seamlessly integrated into the information network”.

As we can observe, based on what is mentioned above, that: things are the pivot elements in the IoT system. Those things are presented in the literature in many forms: devices, sensors, actuators (microcontrollers), as a Cloud service (see Sect. 2.2), and others. In addition, those things are accessible through the Internet; they generate various types of data; process the captured information about the environment; perform tasks locally or remotely in back-ends (server, Cloud), according to the predefined schemes in order to deliver the required services across sequences of autonomous actions (Whitmore et al. 2015).

While in the oriented visions part, Da Xu et al. (2014) have presented three visions: things-oriented vision, internet-oriented vision, and semantic-oriented vision. The first two oriented visions give things the ability to be uniquely identified or addressed over the internet, in both virtual and physical worlds. The last vision was created over time to simplify the presentation of changing system information and data storage.

Moreover, in literature, there are three types of IoT systems: physical, virtual, or hybrid systems. The first type, the physical systems embed as digital objects in the real world. The second type, the virtual systems, use the virtual machine technologies in the Cloud (Virtual Objects, VOs) to virtualize the things over the Internet for controlling and adapting the IoT solutions. For the last type, the hybrid system, the hybridization show as the minimum, full or partial integration with the Cloud (see Sect. 3). This type of system groups both the physical and the virtual IoT systems to bridge the gap between the real and the virtual world, and to guarantee the operations of the IoT system under the dynamic environment (Whitmore et al. 2015).

2.1.2 IoT-architectures: an overview

A deep investigation in the domain of existing IoT architectures shows the existence of three major layers: physical, network, and application layers (Da Xu et al. 2014; Al-Fuqaha et al. 2015; Whitmore et al. 2015; Li et al. 2015; Zaslavsky et al. 2013; Atzori et al. 2010). Obviously, all the defined IoT architectures were inspired by the TCP/IP communication model; and, each proposed architecture was reformulated according to system needs and objectives.

In the following, we explain the differences between IoT architectures (see Fig. 1); in addition, for each one, we give its limitations and its advantages in order to justify the selected IoT architecture for our contribution:

Fig. 1
figure 1

The TCP/IP model vs. The IoT-architecture from Al-Fuqaha et al. (2015)

  • Three-layer architecture This architecture was designed for the WSNs systems; those systems have a high level of homogeneity in sensor hardware and network communication (Da Xu et al. 2014). Hence, this architecture does not support the flexibility and the dynamicity of IoT systems.

  • Middleware based architecture This architecture allows the users to define the tasks and to deploy new services remotely, during run-time, to facilitate the integration of IoT services in order to satisfy application requirements (Atzori et al. 2010).

  • Service-Oriented Architecture (SOA) This architecture deals with the extensibility, scalability, modularity, and interoperability problems among heterogeneous IoT devices. Based on the service layer that contains the following elements: service division, service integration, and service composition. Obviously, Cloud technologies are needed to cover the network, service, and application layers to perform the discovery, to manage the interaction between things, and to compose the required service for users. However, the issues in this architecture are the lack of device-abstraction functionalities and communications capabilities between them (Atzori et al. 2010).

  • Five-layer architecture This architecture offers several mechanisms for controlling and accessing data for the application layer. Thus, it is the adequate architecture for the IoT system; although, it is still under enhancements.

It is important to mention that Atzori et al. (2010) proposed different architecture based SOA for IoT middleware, The authors divide the service layer into sub-layers: service management and service composition. The service- management layer is used for the dynamic object discovery, status monitoring, and service configuration in order to control the QoS and context policies. On the other hand, the service composition layer is used to build specific applications by composing the actual run-time services in large possible scope. However, Benayache et al. (2019) enhanced this architecture using the concepts of the microservice to supports the dynamic-service interaction.

In the following, we present the IoT-architecture layers common functionalities in terms of levels. The levels are defined according to the assigned tasks and the inputs/outputs of each layer (see Fig. 1):

  • Level 01 This level contains a high heterogeneous in the hardware components; besides, its components are able to sense the environment, collect its data and process them according to predefined tasks.

  • Level 02 This level may include either a simple layer: the network layer (backbone network layer), or a composite layer (Object abstraction layer). The first type transmitsFootnote 2 data and maybe integrate the Cloud technology (see Sect.  2.2) in the hybrid system. Likewise, the second type is divided into two sub-layers: interface and communication layers. The interface sub-layer is used to reduce the complexity of communication between devices using the web-service interface. The communication sub-layer translates the income messages to several pre-ordered tasks by hidden the device heterogeneity.

  • Level 03 This level includes different layers that have similar functionality, service layer and middleware sub-layer, in which both deliver the required services over network protocols. Although, the middleware sub-layer integrates the communicated devices to deliver the required services and to optimize their energy. Besides, the coordination sub-layer provides storage and data processing, guarantees the communication between the devices that have the same objective in order to deliver a global service, or to coordinate between services to make specific decisions.

  • Level 04 This level uses the application layer to visualize the collected data and to offer services through the user-interface to meet consumers’ requirements.

  • Level 05 This level includes only the business layer; it manages the received data using the business model or graphs to support the future-decisions made by end-users; however, the end-users information privacy assurance still needs enhancement.

2.2 CloudIoT paradigm

2.2.1 Paradigm views and models

The integration of both IoT and Cloud technologies created the CloudIoT paradigm. This paradigm changes the classical -financial aspect of service delivery from the concept of ”pay-as-you-go” to the concept of ”pay only for what you use” (Botta et al. 2016). In this paradigm, a novel model of services, X as a Service (XaaS) model, is created (Perera 2017; Cavalcante et al. 2016; Botta et al. 2014, 2016). This model provides new types of services presented in Fig. 2 and explained in the following:

  1. (1)

    Network as a Services (NaaS) model offers the visualization of devices in WSN.

  2. (2)

    Sensing as a Service (\(S^{2}aaS\)) model presents ubiquitous access to sensor and sensor data (see the next section).

  3. (3)

    Sensing and Actuation as a Service (SAaaS) model enables the automatic control of devices.

  4. (4)

    Sensor Event as a Service (SEaaS) model used to dispatch the sensor messages.

  5. (5)

    Sensor as a Service (SeaaS) model provides ubiquitous management of remote sensors.

  6. (6)

    DataBase as a Service (DBaaS) model provides ubiquitous management of the database.

  7. (7)

    Data as a Service (DaaS) model offers ubiquitous access to a different type of data.

  8. (8)

    Ethernet as a Service (EaaS) model supports the connectivity to remote devices.

  9. (9)

    Identity and Policy Management as a Service (IPMaaS) model manages the ubiquitous access to things.

  10. (10)

    Video Surveillance as a Service (VSaaS) model supports the ubiquitous access to recorded video.

Fig. 2
figure 2

The XaaS model in cloud architecture

2.2.2 Sensing as a service model

On the basis of the XaaS model, the \(S^2aaS\) model was introduced as a solution for IoT infrastructure based smart cities (Zaslavsky et al. 2013). Furthermore, it adds more flexibility, scalability, elasticity, reliability, offers maintenance strategies, and disaster management in the Platform as a Service (PaaS) layer (see Fig. 2). Indeed, the sensors, in this model, are the basic data generator, and they could be presented physically and virtually (Perera 2017).

It should be noted that this model has four basic elements: (1) Sensor data owners, (2) Sensor data publishers, (3) Extended service providers, and (4) Sensor data consumers (Perera 2017). The extended-service providers give the ability to share the IoT Infrastructure, collect unavailable data (passed by two others), make decisions/policies on the fly (at run-time), and minimize the cost using collective cooperation in order to accomplish the system objective.

The \(S^2aaS\) model includes implicitly the SeaaS model to full fill the SOA architecture for IoT middleware by the encapsulation of the physical and the virtual sensor in it Zaslavsky et al. (2013). In the SeaaS model, sensor management is the main element rather than data collecting and sharing. In order to support the services discovery and composition in SOA, it is divided into three layers: (1) real-world-access layer (hardware communication), (2) semantic-overlay layer (for configuration), and (3) services-virtualization layer (Li et al. 2016; Zaslavsky et al. 2013).

2.2.3 Cloud-based IoT architectures

In the following, we present the classification of Roman et al. (2013) for the IoT architectures, under the Cloud-IoT paradigm. The classification included four categories: Centralized IoT, Distributed IoT, Collaborative IoT, and Connected Intranet of Things (see Fig. 3):

  1. 1.

    Centralized IoT (the most used approach) It includes the central entities that could be Cloud platforms or edge computers. The things in this approach are passive in the management aspect, and can only produce data to be consumed at the central entity to deliver it to customers.

  2. 2.

    Distributed IoT This approach provides local and collaborative services with the possibility of integrating Cloud services; in fact, it appears to the users as a single coherent system. Hence, things shift from isolated entities (accessed only through the Cloud) to an interconnected system (accessible through the Cloud, IoT application, and other things). Therefore, this approach presents several advantages: it brings more openness to the IoT system (multiple APIs); it adds more viability in finances and business part; it ensures the reliability (availability and fault tolerance) to retrieve partial missed information and tolerance to abnormal situations (e.g., resilient solutions); and other advantages.

  3. 3.

    Collaborative IoT The collaborative process, in this approach, is done by using the distributed approach to provide an easy dynamic process between various platforms. This approach helps to create a distributed environment in which a device can be discovered and visible for more than one space (platforms collaboration) in order to generate new services or enriching the existing ones. Yet, the authors mentioned the lack of demonstration on the collaboration process in the developed architectures based on this approach.

  4. 4.

    Connected Intranets of Things This approach offers the local processing of information and provides open access to users. Therefore, the authors highlight the poor system management, in this architecture, to face abnormal situations as device failure and the absence of discovery mechanisms or ontology approaches.

Fig. 3
figure 3

The categories of IoT architectures under the CloudIoT paradigm from Roman et al. (2013)

2.2.4 Related-IoT-Cloud platforms

Several types of Cloud platforms have been developed to support the IoT systems. The IoT-Cloud platforms provide a high-level of security, mobility, and scalability for IoT solutions, by storing, processing, analyzing the device data, managing the devices-identity, and offering a device discovery over the network (through the Cloud platform). In fact, the development process of those platforms is done in two ways: (a) specific, according to the companies’ requirements, or (b) open access, which allows users to publish easily their own IoT solutions.

In the literature reviewing done by Ray (2016), the author classifies the existing platforms according to the following application domains: application deployment, device management, system management, heterogeneity management, data management, analytic deployment management, monitoring management, visualization, and research.

The device management is the selected application domain, in the CloudIoT paradigm, for our context. According to Weber (2019), the device management application domain integrates the following categories:

  • Provisioning (registration) and authentication.

  • Configuration and control (applied by end-user).

  • Monitoring and diagnostics (the local fail and bugs).

  • Software maintenance and update (update according to scheduling or as maintains action).

Furthermore, there are two different classes of IoT platforms based on Ray (2016) application domain: the IoT middleware platforms and the generic-IoT platforms. The first class uses middleware technologies to manage the selected domain application issues into the cloud. However, the second class uses to handle the exchanging data between the IoT applications, processing data, integrating the device-data, and controlling the device from the Cloud.

It is worth mentioning that in the literature, the generics or based middleware platforms, such as Kaa (https://www.kaaproject.org/), OpenIoT (http://www.openiot.eu/), Openremote (http://www.openremote.com/), AWS (https://aws.amazon.com/fr/), ThingsBoard (https://thingsboard.io/), and others, centrally manage IoT devices from the cloud with minimal awareness of the dynamic context of distribution-IoT devices and without considering the interaction between them, which makes these solutions more virtual solutions than physical ones (Ray 2016). Moreover, the healing mechanisms are integrated into those platforms to restore their functionalities, in failure situations, and to ensure the availability factor based on spatial redundancy or based on the fault-tolerance (replacement/ replication) concepts. Thus, device management is presented in registration, controlling, and monitoring the device data through the Cloud.

Although the integration of self-management behavior, in things as an active agent, reduces the cost of using the Cloud-based IoT system, whereas the platforms ignore this integration. Besides, this behavior could be applied on both sides (things and Cloud) by associating to each device (things) a responsible element in the platform, which gives the Cloud more context-awareness. For this, we select, in the following, two platforms from both classes that apply potential association elements:

  1. 1.

    OpenIoT The OpentIoT platform is categorized as IoT-middleware platform that includes the \(S^2aaS\) model and delivers on-demand IoT services. The OpenIoT-middleware infrastructure supports the deployment of IoT sensors, protocol solutions, and the management of business/ application events. Furthermore, the designed resilience mechanisms (protection and recovery) for IoT services are integrated into the Cloud as a middleware layer (Abreu et al. 2017) and applied by the device virtual manager (virtual solution) in order to manage the failure of the virtual device in the Cloud infrastructure instead of the IoT infrastructure.

  2. 2.

    AWS The AWS platform is classified as an IoT-generic platform that allows the modification of the device data in an offline mode as a management solution in the Cloud. Besides, the uploading modification is done at the instant the thing becomes online. This platform allows communication between clients (application) and the required physical devices through the Cloud platform. In fact, it offers a digital presentation of IoT devices (shadow), which is not similar to the presentation of the thing as VO (Alshehri et al. 2018); however, two limitations should be noted:

    • The lack of a dynamic communication process between things (shadows), communication is ensured by the presence of the access file between them.

    • A thing’s shadows save only the reported and the desired data from its own device as a low level of tasks.

For our contribution, we will use the term shadow as a virtual presentation having the same physical device services with the abstraction of its data, services, and the device controlling system. Our definition of the term shadow is similar to the virtualization of the device in the \(S^2aaS\) model in the PaaS Cloud layer (the shadow agent, see Sect. 5).

2.3 Denied of services attacks

2.3.1 Attack overviews

The botnet builders invest in the security field to minimize the risk of control IoT devices maliciously; in fact, if the devices have no-built to be controlled remotely or inaccessible in location, it complicates the healing process from the attack and increases its vulnerability (Bajunaid 2015; Roman et al. 2011).

Besides, different threats could be detected in the (I)IoTFootnote 3 system. Rubio et al. (2019) address availability, integrity, confidentiality, and authentication threats, as cybersecurity threats in this field, where in most cases, the device vulnerability is a reflection of the availability threats, which could be materialized using either the denial of service attacks (DoS) or the distributed denial of services (DDoS) attacks (Ferdous et al. 2016).

The difference between them is the attack source if it is from many, it’s DDoS else, it’s DoS; however, in both attacks, the control is done by one master workstation. The consequences of those attacks are very catastrophic against the system performances (cause resource depletion and subsequent performance degradation) (Carl et al. 2006), for example, the Mirai toolkit or the Mirai botnet attack that happened on 4th November 2016; 100,000 victims of IoT-devices have been slaved using video-recorders, surveillance cameras, and set-top boxes (IoTSec 2019).

2.3.2 Attack mechanisms

Overall, the possible-illegitimate activities that could be executed on the IoT-device victim (thing) that threats it’s availability are either: disruption its communication or its data traffic, or targeting its equipment (Rubio et al. 2019). In such cases, the device will be out of legitimate-user control, and it will be re-configured to make it unavailable. In fact, this may be realized by using the static routing table, password files, security policies, or authentication (flooding attacks) (Carl et al. 2006; Liang et al. 2016; Djenna and Saïdouni 2018), or precisely using (WISeKey 2016):

  1. 1.

    Door opening The devices that use the Telnet protocol for communication get faulty, which opens the door for malicious activities.

  2. 2.

    Get into the house The installation of the malfunction tools (application/ package) causes the existence of a botnet attack that controls the device when it re-connects to the internet.

  3. 3.

    No bandwidth limitation The victim node is suffering from a high-level of traffic (overloaded bandwidth).

Therefore, the applied strategies to maximize the availability of devices are the healing mechanisms through assurance testing (detection methods), protecting the device using the negligent patching through modifying the device authentication (e.g., password) to block the illegitimate activities, rebooting, deleting the attack memory (e.g., Mirai memory), or both using the defense mechanisms.

2.3.3 Defense mechanisms

In the following, we define and illustrate the basic roles of each defense mechanism that reported in Kaur et al. (2017):

  1. 1.

    Attack prevention Attempting to suspend attacks before its occurrence by examining the weakness of protocols, inadequate identification access, and unprotected computer/ OS.

  2. 2.

    Track back Hindering the malicious attacks by finding their original source.

  3. 3.

    Attack detection Using the anomalies-based technologies to observe the regular functions and to discover the oddities in the incoming flux (like traffic).

  4. 4.

    Attack reaction Minimizing the cost and the resource lost caused by the (D)DoS when it’s underway and maximizing the QoS parameters.

The appropriate defense mechanism for our contribution, in the case of DDoS attacks, is the attack reaction. Thus, this mechanism helps things to protect their services and heal from a degraded state to ensure service availability. Consequently, it needs collaboration between the protection and the healing mechanisms in the hybrid-IoT system using the partial integration of the Cloud technology. In the sequel, we will illustrate the basic elements of healing and protection processes based on autonomic computing (AC).

2.4 Autonomic computing

A brief presentation on the autonomic computing and the existing autonomic-control loops are given in this section.

2.4.1 Time-line review

It is important to mention that the proposed protocol is based on the concept of the diagnosable systems; also, it uses autonomic behaviors as the main process to ensure the availability of the IoT-services. The meaning of the term diagnosable, in the English dictionary, is: ”to distinguish or identify (a disease, for example) by diagnosis” (FarlexInc 2016).

It appeared for the first time in the 60s, in Preparata et al. (1967), as a mechanism for treating multiple fault problems in a system composed of the number of units. Each unit of the system tests other units to diagnose fault patterns; the sequence of diagnosis events creates diagnostic elements. Hence, any system units can launch the diagnosis process to maintain the system state using this attribute.

Over time, the diagnosable was used to solve the problem of diagnosis according to system typologies to detect the faulty units or the fault-free unit (Friedman and Simoncini 1980; Das et al. 1994). Additionally, it is used as an essential attribute of autonomic behavior to heal the state of the system under abnormal situations.

Back in 2001, IBM company has introduced the concept of autonomic behavior that allow the system to manage itself in crucial situations and in a changing environment using four control loops: self-configuration, self- optimization, self-protection, and self-healing (Kephart and Chess 2003; Computing 2006; Sterritt 2005; Tosi 2004); besides, this behavior completes the knowledge of its internal state and its external situation states using, respectively, the self-awareness and context-awareness elements in order to accomplish a proper modification (Sterritt 2005).

Likewise, Salehie and Tahvildari (2009) have introduced another concept of autonomic behavior as an adaptive behavior in order to make decisions at run-time. In fact, both visions have a different application of autonomic behavior. The self-management system uses autonomic behavior in single-system level views to reduce maintenance complexity. Otherwise, the self-adaptive system uses it in several system-level views; if the behavior is applied in a single-system level, then it requires the application of it for the other levels of the system (Angarita 2015).

On the basis of the introduced multilevel approach by Parhami (2015) (see Fig. 4), the system could be: (a) Defective, if the physical imperfections in components may be observed, (b) Faulty, if the logical deviation was present in system circuits, (c) Erroneous, if the informational distortions have producing a changing at the state level, (d) Malfunctioning, if an anomaly is detected in structures-level (architectural anomalies), (e) Degraded, if the system-behavioral lapses in service level, and (f) Failed, if the results have computational breaches.

Fig. 4
figure 4

Different levels of the single system from (Parhami 2015)

It should be noted that our contribution targets the highest level of system view, particularly, the service level, by monitoring the system behavior to detect either the misbehavior or miss-accomplished service, under collaboration state, in the second cycle of a single device (system).

2.4.2 Self-healing control loops

In order to maximize the availability of IoT services under the DDoS attacks or device failure, the self-protection and self-healing are the appropriate control loops for that. Although, in the literature, we found only the definition of the system states under the self-healing introduced by (Ghosh et al. 2007) (see Fig. 5). Thus, in the following, we illustrate the attributes of this loop based on both types of autonomic behavior.

Fig. 5
figure 5

System states under the self-healing loop form Ghosh et al. (2007)

In both self-management and self-adaptive systems, the self-healing loop gives the ability to face unpredictable events in order to deliver a high level of services; however, those systems have a different view of this loop elements. The self-adaptive system defined it as self-diagnosing and self-repairing in malfunctions situations, which targeted the middle -level of the cause-effect diagram in six-level views (Angarita 2015).

Likewise, the self-management systems use the self-healing loop to maximize the availability, maintainability, reliability, and survivability of the system, in which it includes three main attributes: detection, diagnostic, and recovery (see Fig. 6) to detect the abnormal situations, diagnose the causing root and elaborate the recovery plan, and execute it without side effects (Angarita 2015) (see Fig. 6).

Fig. 6
figure 6

The self-healing attributes from Angarita (2015)

3 Problem space

First and foremost, the Cloud, IoT, and AC domains have been introduced and used in the literature independently for many applications. Several studies have integrated them in different ways to ensure the delivery of a high-level QoS (see Fig. 7). First, the different types of IoT systems: physical, virtual, and hybrid systems have been identified in Sect. 2.1.1 that often required the integration of the Cloud technology with it. In contrast, the integration between the IoT ecosystem and Cloud computing has been discussed in Sect. 2.2.

Thus, the Cloud could be embedded in the IoT architectures (see Sect. 2.1.2) in different ways. According to Botta et al. (2016), the Cloud could be embedded into IoT architectures, as an intermediate layer between things and the application layers to hide the extra-coding and algorithms complexity; in the network layer by managing the exchanging data or filtering the sending data to the Cloud using the fogFootnote 4 nodes; or, in the application layer to appear as a pure business model.

Overall, the above presentation of the CloudIoT paradigm integrates the Cloud partially compared to the minimum integration that uses the classic-Cloud layers (IaaS, PaaS, and SaaS). Likewise, the full integration between Cloud and IoT defined a novel IoT vision for smart cities based on physical objects, named Cloud of things (CoT). Thus, the Cloud layers are redefined as a City Infrastructure as a Service (CIaaS), City Platforms as a Service (CPaaS), and City Software as a Service (CSaaS) (Cavalcante et al. 2016).

In fact, the Cloud offers a rich platform to implement several strategies for controlling and managing IoT devices (things) to provide a re-usability, elasticity services which open the door for online collaboration (Ullaha et al. 2018), and to guarantee a stable QoS (Anisetti et al. 2020). Besides this, the Cloud can benefit from the highly distributed, open scope of things with context-aware capability (Botta et al. 2014).

Second, Hu et al. (2008) and Mohanan et al. (2017) have discussed the integration between the IoT and the AC. Additionally, they highlighted the motivation for this integration, which supports self and context management and provides more abilities, such as:

  1. 1.

    A dynamic configuration of resources in the required condition.

  2. 2.

    A high level of monitoring for any situation.

  3. 3.

    A dynamic replacement, also named self-matchmaking, of the disconnected or failed component according to a recovery plan.

  4. 4.

    The integration of self-description property. The sensor, under this property, will be more flexible to be used on a large scale and presented in any context.

Fig. 7
figure 7

The integration between the three domains

In the distributed IoT category, things are autonomic, highly distributed, and able to create novel services as request; hence, they are visible in the open range of discovery and can improve the network performance to assist data ow through communication (Qureshi et al. 2020). The autonomous things satisfy the end-user requests and ensure the system functionality by managing their tasks in the case of losing their basic functions (Angarita 2015), which allows the presence of autonomic elements (agents) in the system to operate as responsible units (responsible objects) for its elements to deal with the distributed manner of IoT devices and the complexity of integration between the IoT solutions.

Moreover, Ashraf and Habaebi (2015) give the motivation for integrating the AC properties to manage any security conditions (dealing with the CIAFootnote 5) and to switch into the fail-safe mode using a mitigation approach. However, the integration of the three domains Cloud, IoT, and AC creates a new concept named Cloud-based autonomous things; furthermore, the scope of the integration can be extended to the Cloud of autonomous things as a basic platform for smart cities. In sum, under the Cloud-based autonomous things concept, we propose a protocol to ensure the availability of the provided services.

4 Related works

It is worth to mention that most of the work in the literature use the recovery as fixed action after the attack happens and don’t deal with it in real-time to ensure the service availability after detecting it. In this section, we will present the existing-healing contributions to deal with the availability problem (under attack and failure) in two sections. The first section includes the self-healing solution based on generic-IoT architecture (uses only for the physical-IoT system) to heal from the IoT device’s misbehaviors. The second section targets the same problem using Cloud technology for the hybrid-IoT system.

4.1 Related works based on physical-IoT system

In Kühn et al. (2018), the maximization of the reliability, in the IoT system, is done using the formal specification for distributed systems. The process triggers the healing actions in the presence of failure in IoT node or in links between components breaks. As a result, the system moves back to the correct state after detecting the failure. Indeed, the repairing is established in several ways: reconfiguration/ restoration of the device, relocation of the hardware/software components, or using a surrogate component that must have the same purpose as the failed one.

A self-healing model is used to increase IoT system reliability. This model uses a volatile node (powerful node) responsible for observing the system functionalities/behaviors; in fact, the model has three attributes: monitoring, diagnosis, and mitigation. First, the monitoring attribute includes detection of a violating event (happens by coincidence, not a malicious event), violating requirements, and analyzing the involved IoT devices under this detection.

Second, the diagnostic attribute uses the analysis result to locate defective components and to identify the causing problem to create the appropriate plan or strategies for repairing. The last one, the mitigation attribute, executes the selected plan in order to return the system into its correct state again. Additionally, the information base (knowledge part) is a shared process for all three elements; and it can store the failure model for future use. However, this contribution uses the system designed hierarchically with the integration of the defined-monitoring components, using the redundancy of the responsible component to ensure the safety-critical.

In Montoya et al. (2018), the authors presented another form of attack in the IoT nodes, Denial of Sleep attack. In which the attackers are sending, continually, the request messages to the victim node to gate it back to sleep mode. This type of attack targeted the device energy and exhaustive its battery life cycle (physical attack). The proposed-lightweight algorithm, SWARD, decomposes the system into sub-system: on-demand and always responsive. First, the on-demand sub-system is activated or deactivated according to requirements to save the battery energy; on the other side, the always-responsive sub-system is reactivated after receiving the token message (reception event) in the device letterbox. Indeed, the SWARD protocol generates the hard-to-guess-wake-up token to mitigate the existing number of the wake-up token sent by the attackers in the device letterbox in order to increase the device energy under the attack, ensure confidentiality and authenticity, and enhance the efficiency (performance) of the vulnerable device.

Other work was done by Carl et al. (2006) to detect only the existence of (D)DoS in the IoT device. Remind that the detection is the first attribute of the self-healing process plus the diagnostic and recovery. This work presents three types of detection methods for (D)DoS activity profiling, change-point detection, and wavelet-based signal analysis. Firstly, the activity profiling clusters the similar network packets into IoT islands to determine the packets flow and to avoid high-dimensional issues; in fact, the increase of the flow in the cluster level indicates the existence of attacks.

Secondly, the change-point-detection method isolates and filters the changed traffic caused by attacks from usual-monitor traffic (statistical method). Thirdly, the wavelet-based signal analysis method searches for a weak point in the system or an open door for a vulnerability attack in order to identify it. Additionally, the authors presented a detection process for the TCP SYN attacks (a type of DoS attack). In which the detection process repeatedly examines the TCP flag in order to test the compatibility of the flag information in the transported packet.

In WISeKey (2016), the authors proposed a protection solution for IoT devices against the DDoS attacks; in fact, the protection is a pre-processing of the analyzed attack on the devices. In sum, the following points present the possible illegitimate activity; illustrated earlier in Sect. 2.3.2; attached with solutions:

  1. 1.

    Door opening The potential solution for this problem is either forcing the modification of the password periodically or using the mutual authentication by the private key.

  2. 2.

    Get into the house The basic solution for this type of DDoS attack is either updating by a third party or activating the system firmware.

  3. 3.

    No bandwidth limitation (default limitation) The applied solution deals with DoS attacks in the authentication process using Public Key Infrastructure in order to limit access to the device data and maximize their integrity.

Another solution was proposed by Sharma et al. (2017) based physical-IoT system using generic IoT architecture in order to mitigate the zero-day attacks and provide reliable communication, with the absence of autonomic behaviors. The proposed solution, IoT-diagnosis system, was installed in several ways:

  1. 1.

    In the central units (Centralized Diagnosis System, CDS) to share the available-diagnostic results with others, and maximize the reliability of the IoT system.

  2. 2.

    In sub-central units (Local Diagnosis System, LDS) to monitor, manage the operability of IoT devices and store the distributed details about each device state in the sub-system. Hence, the result of the local diagnostic gives an overview of the whole system, in which the units are organized hierarchically.

  3. 3.

    In the local unit (Semi-Diagnosis System, SDS) to collect data from devices, and share them directly with other units using a sharing protocol. Besides, the system checks the authenticity of each device once the vulnerability appears in the network.

4.2 Related works based on hybrid-IoT system

A hybrid solution (firmware) presented in a technical report by Bajunaid (2015) named Cloud Orbit, which investigates in the cyber-attacks (DDoS) in IoT systems to maximize the availability of IoT devices. The Cloudflare orbit is used to secure the authentication between the device and its server by implementing a virtual path of IoT device for reconnection, and fix the attacked device in reality by updating its system. However, the authors presented the healing process in two-stage detection and repairing using fixed actions.

The firmware detects the suspicion of access for illegitimate users in the authentication part by comparing the certificate information of the device and their producers; if the result of detection was positive, the Cloudflare block connectivity from the compromised units-attack resource. After the attack happens, the user or the device owner will be responsible for downloading and install the hot-fix or recovery path (update the device using the cache).

In Yaseen et al. (2018), the hybrid system includes not only the IoT and the Cloud technology but also fog computing; it has been developed in a multi-level solution to deal with the collusion attacks in WSNs, in which the fog is placed as an intermediate between the IoT devices and the Cloud. Hence, the IoT system was divided into sub-systems, each sub-system refers to the island, and each island includes a cluster managed by a responsible fog node for it. This island separation helps to keep tracking the devices searching for potential attacks at real-time processing.

Besides, the proposed solution supports the dynamic changing of device location using a hand-to-hand mechanism between the fogs nodes. Moreover, the cloud process and analyses the received information from each fog node to mitigate the collision; in fact, the detection attribute is an active process applied in a real-time process. Instead, mitigation is a passive process in the cloud; it is used as pre-processing of detected information to maximizes the reliability of the system and extend the knowledge of the existing attacks.

It worth mentioning that the resilience mechanisms are used, also, as healing strategies to protect and repair the IoT system in potentially malicious environments. For example, in Kim et al. (2017), the authors present a resilience mechanism for the IoT gateways, which is deployed as a Cloud server or local-node (e.g., router), named secure migration in order to enhance the availability of the IoT infrastructures. Additionally, they use the migration technique to reconnect all the authorized entities of the failed node (entity) to another local node.

Both nodes are deployed using edge computing, named Auth, and have a common entity (device) to help the execution of the migration plan smoothly according to pre-defined policies. The detection of a failure in the edge node is applied by one of the registered entities to this node based on the delay of responding in a finite amount of time. Another proposed approach by Cirani et al. (2018) has the same vision of the previous one based on network elements. This approach replicates the edge node under the synchronization process, named IoT hub; thus, the node exists implicitly in the cloud to manage the IoT resources in a scalable and secure way.

5 The proposed protocol

In this section, the CIoTAS proposed protocol will be introduced to solve the availability problem under the things collaboration. First, we give a motivation for our proposed approach. Second, we present novel states of the system under two control loops (self-healing and self-protection). Third, we present the abnormal situations in the collaboration phase and the effect of them on IoT devices under this phase in the threat model and analysis section. Fourth, we proffer system parameters, data structures, and the formulation of the collaborative available services problem. Fifth, we illustrate the proposed approach, its general functionalities, its fundamental events, and procedures. Then, the proposed protocol will be proved based on the introduced properties and a defined theorem, and an example will be presented to clarify the application of the protocol procedures.

5.1 Motivation

The autonomy is a powerful search technique that has been successfully used to solve management problems of system misbehavior in a single device/node. The Agent-based systems previously considered in AC; provides service availability by redundancy in an open environment using the self-healing and the self-protection loops. The agent, in the IoT system (Distributed IoT), has the self-awareness of its environment, it’s responsible for managing unpredictable events to achieve a pre-defined-system objective, and is able to collaborate with its neighbors to heal from poor service delivery.

The proposed solution answers the following question: ”How to heal the device state from miss-collaboration with its neighbors after providing their services to it (unavailable action) in an IoT-dynamic environment ?”.

5.2 System states under control loops

5.2.1 States of system under collaboration

The traditional definition of the self-healing-control loop contains three system states: normal, degraded, and broken states (Ghosh et al. 2007). According to this definition, the system changes its state depending on the functionality of its elements and based on individual execution of local-required services (Psaier and Dustdar 2011). Yet, the system states are defined without considering the collaboration situation with its neighbors to fulfill the required tasks. However, if the collaboration situation is introduced, the device state will depend on the existence of the received results from its neighbors (devices) for the requested tasks to continue operating in its normal state.

In Distributed IoT, the collaboration is needed. In such a case, the device interacts with its neighbors (contextual devices), invokes the devices that may execute the required tasks/ services, and wait for the result. If the system receives the results for the pre-requested service, then the system returns to its normal state, and it continues operating until the next collaboration. Hence, the self-healing loop controls the system states under both phases: local and distributed task processing.

Alternatively, the unhealthy interaction with other devices (things), in the collaborated state, will cause a degraded state of system services. This degradation could lead to a broken state if the diagnosis element can’t elaborate on a recovery plan, or the system doesn’t execute the chosen plan rightly. Likewise, the system state may crumble to a broken state; if the DDoS (availability attack) was detected as a suspected behavior in the service provider (device). Figure 8 illustrates the system state under a collaborated-self-healing loop.

Fig. 8
figure 8

System states under a collaborated-self-healing loop

5.2.2 States of system under protection

The defined attributes of the self-protection loop are anticipation, detection, protection, and mitigation. Note that either self-healing or self-protection loop could launch the detection attribute; the cooperation between the control loops minimize the AC process and maximize the QoS parameters. Instead, the self-protection needs implicitly the existing of self-healing to maintain a healthy state.

Fig. 9
figure 9

System states under the self-protection loop

The defined self-protection loop (see Fig. 9) aims to discover an attack in the system and protect it in a self manner. For the discovery process, the system recognizes and analyses the existence of an attack in its elements by monitoring its event using a single/ composite component responsible for scanning the collected data researching for anomalies. Based on the discovery stage, the system is in a healthy state (normal state); if the component (scanner) doesn’t detect the existence of malicious code in system files, the system will be unhealthy.

The detection process could be done by either using the signature detection or behavior detection (proactive detection) (Kolias et al. 2017; Hallman et al. 2017). The signature detection deals with the file system in byte code to detect the abnormal code using the comparison methods. Otherwise, the behavior-detection approach monitors the systems events looking for the suspicious or the existence of the malicious element.

In a positive detection case, the self-protecting loop protects the system using either the Sandbox (emulator) mechanisms or virtualization mechanisms to control the manipulated resource by the malicious element. In the first mechanism, the emulator fakes the environment to test the results of manipulation and keeps the program running in framed execution (isolation and execution). In the second mechanism, virtualization, the system offers the actual execution of system resources by the malicious code according to pre-defined rules.

The system in this stage will be in a vulnerable state; it uses both diagnostic and recovery elements of self-healing to return to its normal state. Besides, the diagnostic identifies the causing a problem that could be a remain fault or external attack (malicious code/ various), declares the system state (healthiness, sub-healthiness, or sickness) (see Fig. 6), and makes a planning decision to elaborate recovery strategy.

The recovery process is performed by prevention methods (local replication of resources) or by backward techniques, such as compensation of damaged files, data, or services (Angarita 2015). If the protection strategies don’t work, then the state-system will be violated, especially when the system is using the virtualization methods, and the occurrence of damage detection is delayed. The possible solution to recover and return to a healthy state again is an application of the mitigation strategies, such as rebooting or relocating the services of the aggrieved component.

5.3 Threat model

In the proposed protocol, we primarily focus on failures and the availability threats of DDoS attacks that use the vulnerability attacks, where we assume that the attacker threatens the availability of the service provider using the DDoS attacks. Indeed, the attacker may interrupt in the device communication, targeting its equipment (exhausting its power), or injecting malicious code/ malware, which may produce other malicious events (changing its signature). Hence, the application of these illegitimate activities is reflected as unhealthy interactions between the requester and the provider for a requested service.

Obviously, they affect both sides of the service provider and the requester. For the provider part, the provided services are denied, and they are unavailable for request delivery where the provider unresponsive for the requests. On the other hand, the requester suffers from the degradation state that shows as famine by waiting for the result of essential requested services. In fact, they appear either by the absence of the results after the timeout, or the null value of them. In other words, the service exists in the provider but not available for delivering collaboratively, which leads to unacceptable QoS parameter values.

5.4 Threat analysis

The proposed protocol uses the selected defense mechanism, attack reaction (see Sect. 2.3.3). In which the detection mechanism of unhealthy interaction is done using the contextual anomaly-based detectionFootnote 6 (Kaur et al. 2017). This detection helps to define the anomaly in the service provider at the time of the request.

The service requester recovers from the denied service consequences using a defined plan by the diagnostic attribute (the available option). The defined plan could be a selection of a substitute element (provider) or using a replica of this provider (see Figs. 6 and 10) in order to return to its normal state after the achievement of collaborating tasks (for more details see Sect. 5.10).

In a collaborative context, the protection from a suspicious DDoS attack after detecting it, in the service provider, by the collaborative self-healing loop could be done locally, using the attributes of the self-protection loop. Otherwise, remotely, using the mitigation by a responsible node, this node works as the emulator for the violated device. In which this node chooses the appropriate reaction according to the type of the detected DDoS attack, such as updating the device files or rebooting it remotely.

5.5 System-parameters definitions

An IoT system is an open system, it comprises a number of agents, noted \(Ag_i\), \(Ag_j\),..., that manage the IoT devices (an agent for each device). The system agents communicate only by exchanging messages over the network. In the proposed protocol, a limited amount of control messages are generated for the diagnostic and recovery processes, when necessary. The passing messages are asynchronous; each device has its speed and location, sends messages through the communication channels in a finite amount of time (delays) that can invoke the recovery process.

The protocol is divided into two parts. The first part is implemented in each physical thing (physical device); it defines its associated agent behavior for managing its collaboration. The second part is implemented in the \(S^2aaS\) model in order to manage the shadow device by the shadow agent. The following is the definition of the protocol entities:

Definition 1

(Physical agent) An agent is a tuple < Ag, SDA State, SCA, LastCx, LibAc > where:

  • Ag is the agent identifier.

  • SDA is a set of available services (actions).

  • State is the agent state variable. It takes its value in \(\{ normal\), \(collaborated, \, degraded, \, broken\}\).

  • SCA is a set of contextual agents (discovered devices).

  • LastCx is a variable that stores the last communication with the shadow

  • LibAc is the library of actions used for managing its data and its state according to its shadow in the \(S^{2}aaS\) model.

Definition 2

(Cloud protocol) Diagnostic attributes is a tuple \(<CAg,\, CSDA\), \( CSCA, \, CState\), CDep, \(LibRact>\) where:

  • CAg is the shadow agent identifier.

  • CSDA is the set of the shadow available services (actions).

  • CState is the shadow physical agent state variable. It takes its value in \(\{\)normal, collaborated, degraded, broken \(\}\).

  • CDep is a set of contextual-discovered devices of the corresponding physical agent. In the case when the physical agent fails, CDep set is used to inform the agents about the failure and as basic information in the future collaboration process.

  • LibReact is the set of alternative actions that should be considered in the case of the requested action failed or denied.

Mention that the copy of \(Ag_i\) is \(CAg_i\) (shadow) in the Cloud platform has the same offered services in \(Ag_i\). It is used by the protocol to manage unpredictable situations.

5.6 Paradigm-defaults properties

In the IoT system, things may not be stable geographically, and they need to interact continuously in real-time mode, react dynamically to external events, and maintain their state individually or remotely. Moreover, the CloudIoT paradigm assumes that IoT devices send numeric information about their states, in a secure and available manner, for managing concurrent tasks and updating their Cloud platform. This may be done by sending the appropriate information periodically, or according to dynamic environment changes. Practically the possible assumptions that should be considered for this paradigm are:

Assumption 1

The device sends periodically the context information to the Cloud.

Assumption 2

The Cloud updates the device state regularly.

Assumption 3

The IoT devices may continuously be moving.

5.7 Problem formalization

In the Distributed IoT approach, the collaboration phase requires the contextual discovery mechanism. For that, the proposed algorithm in Baitiche et al. (2018) will be used to realize this task. On the basis of this solution, we define two healing solutions: the contextual healing solution and the \(S^2aaS\) model as a backup model in order to ensure the self-healing property in IoT devices during the collaboration situation. Both solutions integrate the protection attributes of the self-protection loop and assume the following hypotheses:

Hypothesis 1

All agents requested services are essential.

Hypothesis 2

Cloud services are reliable and available all the time.

Hypothesis 3

The set of available services SDA of the agent Ag is available in its shadow CAg.

Hypothesis 4

The attacker may change the values of the device signature.

Hypothesis 5

Finite time for message transmission.

Hypothesis 6

Connection is reliable.

5.8 Data Structures and Initialization

Each agent \(Ag_i\) maintains the following data structures. Thus, the local variables of agent \(Ag_i\) are :

  • SDA: A set of available actions that the agent (\(Ag_i\)) may execute; its initial value is \(\emptyset \).

  • SCA: A set of contextual agents; its initial value is \(\emptyset \).

  • a.SCA: Given an action a, a.SCA is the set of contextual agents that may execute action a.

  • SFA: A set of failed agents initialized by \(\emptyset \).

  • State, lastState: Variables representing the current and the last state of an agent, the states of an agent can be {normal, collaborated, degraded, broken}. The initial value of a state is normal.

  • R: A String variable, it contains the result of the request of the agent that executes an action a; its initial value is NULL.

  • a.State: A variable that takes its values in {fail, succ}; its initial value is succ.

  • Signature, LastSignature: A String variables initialized by NULL.

  • \(\Delta \): Defines the amount of time to response (timeout), in which is defined by \(\Delta \ge 1 \).

  • \(Diagnostic-State\): A Boolean variable initialized by true.

In the following we define the functions used in the proposed protocol :

  • \(get : SDA \rightarrow SCA\): a function that returns an agent from SCA. In the proposed protocol. This function is called for choosing an agent that may execute a specific action.

  • \(getF : SDA \rightarrow SFA\): a function that returns a failed agent from SFA.

  • \(getS : SFA \rightarrow S\): a function that returns the physical-agent signature.

  • \(Select : SDA \rightarrow S^{2}aaS\): a function that returns an agent from the \(S^{2}aaS\). This function is called for choosing a shadow agent that may execute a specific action.

  • \(\wedge \) : Function that calculates the system state according to Table 1.

5.9 Required properties

To maximize the system-service availability, Psaier and Dustdar (2011) mention that Arora and Gouda (1993) defined two general properties, named convergence and closure, as follows:

  1. 1.

    Convergence means: ”The system is guaranteed to return to the legal state a finite amount of time regardless of interference’s”; it satisfies the fact that something good eventually happens.

  2. 2.

    Closure means: ”Once in the legal state, it tries to remain in the same state again”; it satisfies the fact that nothing bad happens.

For the proposed protocol, these properties are redefined as follows:

  1. 1.

    The protocol Convergence property: If recovery is possible, the protocol guarantees it in a finite amount of time”.

  2. 2.

    The protocol Closure property means: If the protocol detects the existence of DDoS attack in the system then it is real and the protocol recover it”.

5.10 The general functionalities of the proposed protocol

Figure 10 summaries the protocol functionalities and defines the elements of its self-healing attributes, passing by service-discovery management to achieve the availability of the contextual services.

According to the protocol convergence definition, the protocol guarantees the property in two steps:

  • Step 1 To recover, in a finite amount of time, from poor collaboration in services that cause a degraded state in the system, the protocol uses the contextual-recovery plan for the pre-discovered services.

  • Step 2 If the recovery plan doesn’t achieve to recover from the mismatching-contextual services (or unavailable services), then the protocol uses the cloud-replication strategy (shadow agent) as an alternative solution. The shadow saves the state, data, and the context of its physical device all the time. Hence, it appears as the physical device in virtual presentation (VO); then, it is used to maintain the device functionality in the \(S^2aaS\) model in the case of losing its abilities (failed or attacked).

To satisfy the convergence property, the proposed protocol manages the following abnormal situations that the agent can fall into:

  1. 1.

    If action a belongs to the set SDA of an agent \(Ag_i\) (\(a \in abilities (Ag_i)\)) and \(Ag_i\) becomes unavailable, then the contextual solution is used by selecting a substituted-contextual agent of \(Ag_i\) to delegate it the execution of the same requested service (action a).

  2. 2.

    If there is no available contextual agent that may execute a requested service (action a), then the alternative solution is called by using a shadow agent that can execute the action a.

  3. 3.

    If the agent (\(Ag_i\)) becomes unavailable and \(a \in abilities\) \((Ag_i)\), then any contextual agent that requests agent \(Ag_i\) to execute action a, will inform its shadow about it to diagnostic its state and take a required procedure for healing.

The protocol ensures the closure property in the following possible cases:

  1. 1.

    If at time t, an agent misses a result for action \(a \in SDA\), and the protocol detected the existence of a DDoS attack in the requested agent (the provider) that offered action a, then the DDoS attack is real.

  2. 2.

    If at time t, an agent receives, in a normal state, a result message from a vulnerable agent that provides action a after a specific amount of time, then the protocol ignores the message and classifies it as a malicious one.

Fig. 10
figure 10

The self-healing attributes under the proposed protocol

5.11 System events

In the following sections, we will enumerate the possible events that could happen for both contextual and Cloud agents. These events appear during the context discovery and the collaboration between its agents. Each event is associated with either a procedure of the contextual sub-protocol executed by the agent in the physical device or a procedure of the Cloud sub-protocol executed by the Cloud agent in the \(S^2aaS\) model (see Fig. 11).

5.11.1 Discovery-protocol events

For the discovery part, the agent acts and reacts according to the following events:

  • Event 1 (\(e_1\)): When the agent invokes the Discover routine.

  • Event 2 (\(e_2\)): When the agent receives the message Discover from another agent.

  • Event 3 (\(e_3\)): When the agent receives the Response message from another agent.

5.11.2 Healing-protocol events

The abnormal situations that could happen in the collaboration process of the device in its context are:

  • An agent may gain access to some services of other agents and lose access to others.

  • An agent loses access to another agent if it is in the busy state (e.g., updating stage), after accepting the execution of a requested service or deny the request before accepting the execution.

  • An agent loses access to another agent if it is denied (attacked).

  • An agent loses access to another agent if it is failed.

The events of the service-healing protocol according to the required agent are as follows:

  • Physical-agent events

    • Event 4 (\(e_4\)): When the agent sends Request message to another agent in the same context.

    • Event 5 (\(e_5\)): When the agent receives Request message from another agent in the same context.

    • Event 6 (\(e_6\)): When the agent receives Results message from another agent.

    • Event 7 (\(e_7\)): When the agent invokes Choose-Agent-for-the-action routine.

    • Event 8 (\(e_8\)): When the agent receives Result message from another shadow agent.

    • Event 9 (\(e_9\)): When the agent receives Diagnostic-Result message from its shadow.

    • Event 10 (\(e_{10}\)): When the agent receives Get-State message from its own shadow.

  • Agent-shadow events

    • Event 11 (\(e_{11}\)): When the shadow agent receives Get-Diagnostic message from its physical Agent.

    • Event 12 (\(e_{12}\)): When the shadow agent receives Get-Shadow-State message from another shadow agent.

    • Event 13 (\(e_{13}\)): When the shadow agent receives Shadow-Diagnostic message from another shadow agent.

    • Event 14 (\(e_{14}\)): When the shadow agent receives Signature from its physical agent.

Fig. 11
figure 11

The agents relating events graph

The proposed protocol is defined by the events associated procedures given in the following Section. For updating the system state, we introduce \(\wedge \) function that takes as arguments the last and actual system states and gives the resulting system state (see Table. 1).

Table 1 System state updating using the \(\wedge \) function

5.12 Protocol procedures

The protocol procedures are called after the initialization step, as presented in Sect. 5.8. We mentioned that each procedure call is associated with an event occurrence. For the sake of presentation, we divide the procedures into two sets: the physical-agent-procedures set and the shadow-agent procedures set. Since the procedures can be invoked concurrently. Their execution should be atomic for modifying the shared variables in a mutual exclusion manner; only the waiting instruction is interruptible.

5.12.1 Physical-agent procedures

The general structure of an agent is given by the following grammar written according to the BNF notation (Backus Naur Formalism):

figure a

As an example, the description code of an agent may be:

figure b

Thus, an agent may request a distributed execution of a set of actions. Before requesting another set of actions, it should receive the results for the previous set. In the sequel, the procedures are defined for an agent \(Ag_i\).

5.12.2 Service-discovery procedures

  • For event \(e_1\): The agent \(Ag_i\) starts exploring its environment by broadcasting a discover message to detect all the available actions provided by the agents in its context.

figure c
  • For event \(e_2\): The agent \(Ag_i\) responds to the reception of discover message by giving its identity and its list of provided services (set of available actions SDA).

figure d
  • For event \(e_3\): After discovering a context, the agent updates its set of contextual agents (SCA). Consequently, for each available action a, the function Set(a) gives the set of agents that may execute the action a.

figure e

5.12.3 Service-healing procedures

  • For event \(e_4\) : Over time, the agent needs to collaborate with other agents to achieve its purpose. For that, the agent updates its state to ”collaborated” (line 1), saves this state into LastState variable (line 2), and adds the requested action a to SDA variable (lines 3). To send a request for achieving action a, the agent must invoke an agent that may execute the action.

figure f
  • For event \(e_5\): If an agent receives a request for executing the provided action \(a \in SDA\) from another agent, it executes the action and sends the results.

figure g
  • For event \(e_6\) : After sending the requests to other agents, the agent waits to obtain the results (R). If the agent receives the results (R) of a requested action (a), and the result is not NULL, it removes the action a from its SDA set and updates the a.State variable to succ. Otherwise (the results are NULL), the agent updates the a.State variable to fail. The last case happens when the provider agent denies the request due to its busy state (processing other tasks); this case is defined as an abnormal case for the requester.

figure h
  • For event \(e_7\): To send a request for executing a specific action a, the agent \(Ag_i\) must choose an agent provider from its SCA set (line 1). If such an agent exists (\( Ag_j \ne NULL\)), \(Ag_i\) sends it a request and waits for the result (lines 3–4). Hence, the action a remains in SDA set until receiving the result for it. When the requested agent doesn’t send the result in time, the agent \(Ag_i\) state will be updated to ”Degraded”, and the requested agent is added to SFA set (the set of failed agent) (lines 7- 8). In such a case, the agent selects another agent from SCA set (line 9). If no agent can execute action a (\(get(a) = NULL\)), \(Ag_i\) invokes the diagnosis elements by sending a message to its shadow (lines 11–21). This message has two parameters, the NULL parameter and a selected agent from SFA list a (lines 17–18).

figure i
  • For event \(e_8\): When the agent receives the required result for action a from the shadow of a pre-selected agent, it removes the action a from SDA set, updates the action state to succ and its state according to the \(\wedge \) function (lines 2–3).

figure j
  • For event \(e_9\): After sending the diagnostic message to its shadow, the agent will receive the diagnostic result to update both sets: SCA using the updateA function and SFA using the updateF function. The update could be removing the diagnosable agent from SFA and adding it to SCA if its state is busy (under normal state), or the opposite; if its state is ”failed” or ”denied”.

figure k
  • For event \(e_{10}\): When the agent receives the message Get-State from its shadow, it responds by sending its signature (line 1). This event supports the verification of the absence/ existence of possible threats in its physical agent.

figure l

5.12.4 Shadow-agent procedures

The following procedures concern each shadow agent \(CAg_i\).

  • For event \(e_{11}\): When the shadow agent receives the diagnostic message from its physical agent, two cases may be distinguished. In the first case, the failed agent is unknown; therefore, the shadow agent selects a shadow agent that may execute the requested service (action) from the available ones, in the PaaS layer, using its \(S^2aaS\) model. In the second case, it sends a message to the failed-agent shadow allowing it to accomplish the requested service and diagnostic its physical agent.

figure m
  • For event \(e_{12}\): When the shadow agent receives the diagnostic message from another shadow agent (CAg), it assumes the detection of abnormal behavior in its physical agent. For that, a set of actions should be taking. First, it accomplishes the requested service, sends the results to the physical agent (Ag) (line 2). Second, it diagnosis its vulnerable-physical agent by sending a signature-request message (lines 3). Since the response time is bounded, the delayed response to the diagnostic-message proves the existence of abnormal behavior in its physical agent. Consequently, the diagnostic is positive, and the agent \(Ag_i\) is in a failed position. In such a case, the diagnostic result is sent to the requested-agent shadow (CAg).

figure n
  • For event \(e_{13}\): This event may happen when the shadow agent (\(CAg_i\)) is executing the previous procedure, and it is at the waiting instruction. After receiving the diagnostic result from its physical agent, it compares the requested signature with the saved one in its database (lines 1–2). Consequently, it informs the requesting shadow agent about the results of the comparison. The shadow agent may react to the detected DDoS attack based on its reaction Library (LibRAct), in the case of a positive diagnostic. The vulnerability level, in its physical agent, is determined according to the shadow agent analyzing (see Sect. 5.2.2).

figure o
  • For event \(e_{14}\): The shadow agent uses the received diagnostic results to protect the agent and to update its SCA and SFA (lines 1–9). This update is useful in future request fragments.

figure p

5.13 Protocol assessment

To prove the correctness of the CIoTAS protocol, we start by proving the convergence and the closure properties:

Proposition 1

CIoTAS protocol verifies the convergence property.

Proof

Let’s recall the convergence property: ”If recovery is possible, the protocol guarantees it in a finite amount of time”.

In the following, we consider two cases :

Case 01::

All the contextual agents, that may execute the action a, are busy, failed or denied. If the agent, at instant \(t_1\), needs to collaborate with other agents for action a. In consonance with procedure  4, it changes its state to collaborated, adds the action a to its SDA and sends a request to one of the provided agents in its SCA. Under the first case and according to procedure  7, the agent SDA still contains the action a, and the agent keeps sending the request messages due to the absence of the results for this action until its SCA will be NULL, in the case that all the discovered agent may execute the requested action (the time to response will be evaluated in details in Sect.  6.2).

As a consequence of this situation and using the hypotheses  6 and  5, no sending message is lost and it should be received a finite amount of time; the agent sends a diagnostic message to its shadow to obtain the result for the requested action. Hence, the agent will receive at instant \(t_2\) the result for action a from a shadow agent in the Cloud services (hypothesis 2).

Case 02::

The \(i^{th}\) contextual agents that may execute the action a is available. If the agent, at instant \(t_3\), starts the collaboration process with other agents for action a according to the procedure 7, it keeps sending the request messages to the existing agents in its SCA until it receives a result for this action. Suppose that, the first group of agents in its SCA are failed or denied, the \(i^{th}\) agent in the second group is available to give a result for action a; in such a case, the agent will receive at instant \(t_i\) the waited result (not NULL). As a consequence of that, the agent removes the action a from its SDA and update the action states (procedure  6).

Case 03::

The DDoS is real but not detected. Suppose that the denied device doesn’t be sought from other agents for any requested action; in other words, the result of the discovery part doesn’t include the denied agent at any SCA of the requested agents. Thus the healing protocol doesn’t be launched at any stage; the system services are available, and the system continues operating in its healthy state.

Proposition 2

CIoTAS protocol verifies the Closure property.

Proof

Let’s recall the Closure property. ”If the protocol detects the existence of DDoS attack in the system then it is real and the protocol recover it”.

We consider the following cases to demonstrate the previous property:

Case 1::

The DDoS is real and detected.

The agent at any instant can request an agent from its SCA that may execute the action a and obtain the result in a finite amount of time; thus, the connection is reliable (hypothesis 6), where any type of sent messages reaches their destination, therefore the reception of the result definitely happens. If the requested agent incapable of responding to request service due to its state (failed, denied, or busy), then the detection is real.

Case 2::

The requested agent receives a delay messages.

After achieving the collaboration process, the agent returns to its normal state. If the agent, at instant t, receives a result for action a from the pre-requested agent, and its SDA doesn’t contain this action; consequently, the agent ignores the delayed message. The possible scenario for this case is that the agent requested this agent (pre-requested agent) for the provided action a, but didn’t receive the results in a finite amount of time, and the pre-requested agent was added to its SFA (procedure  7). Then it uses either the contextual recovery (procedure 6) or the cloud recovery (procedure 8) and receives the result from either one. In both receptions cases, the agent deletes the action a from its SDA.

Theorem 1

Each requested service is available”.

Proof

Let a be a requested service. Suppose that, the DDoS is detected by the protocol, according to the Proposition 2, the detection is real. Furthermore, the SDA of the requested agent defined as Ag/ \(a \in SDA_{Ag}\), which implies that there is an agent \(Ag^{'}\) \( / a \in SDA_{Ag}^{'}\) has its shadow \(CAg^{'}\) that is available to accomplish the requested service, and according to the Proposition  1, the denied service will be recovered. □

5.14 An illustrative example

Through the following example, we illustrate the main idea of our protocol using the space-time diagram of a distributed system that contains four devices (\(Ag_i\), \(Ag_j\), \(Ag_k\), \(Ag_l\)). Two scenarios will be developed. The first one explains the normal situation, the discovery, and the collaboration process. The second scenario concerns the healing from the abnormal situation using the contextual healing or Cloud healing in the presence of potential DDoS attacks.

5.14.1 Discovery and collaboration phases

In the discovery routine, the agent \(Ag_i\) broadcasts the discovery messages to detect the possible agents present in its context (\(Ag_j\), \(Ag_k\) and \(Ag_l\)). Figure 12 shows the set SCA of \(Ag_i\) that invokes this routine and the responses containing the agent identity and their available actions (SDA). Obviously, both \(Ag_j\) and \(Ag_k\) provide action \(a_1\), and \(Ag_l\) provides action \(a_2\). Consequently, the set SCA of \(Ag_i\) is updated as follows: \(SCA = \{ Ag_j, \, Ag_l, \, Ag_k \}\).

Fig. 12
figure 12

A space-time diagram for discovering behavior

Once the discovery routine is executed, the agent can collaborate at any time with the set of agents in its context by sending a request for actions to the selected agent from its SCA. The space-time diagram in Fig. 13 shows the normal situation of requesting and responding between agents (\(Ag_i\), \(Ag_j\), and \(Ag_l\)), where the set of available actions will be empty after receiving the results for each requested action.

Fig. 13
figure 13

A space-time diagram for normal-collaboration situation

5.14.2 Recovery phase

In the case of detecting abnormal behavior in the requested agent, the protocol searches for a substitute-contextual agent in the requester SCA that provides the same action \(a_1\) to send it a request to it for executing this action, \(Ag_k\). However, the contextual-healing solution is used while the SCA of the requester agent is not empty. Figure 14 illustrates the contextual healing mechanism of the proposed protocol.

Fig. 14
figure 14

A space-time diagram for contextual recovering from abnormal-collaboration situation

On the other hand, if the SCA of the requester agent is empty and the SFA is not after applying the contextual-healing, the protocol relocates the required action (\(a_1\)) in the cloud platform by sending a diagnostic message to its shadow (\(CAg_i\)) and waiting to receive the result from the shadow agent (\(CAg_j\)) of the denied agent (see Fig. 15). As a consequence of that, the agent deletes the action \(a_1\) from its SDA since the reception of the result from the shadow agent (\(CAg_j\)). The shadow agent (\(CAg_j\)) is responsible for diagnosing the state of its physical agent (\(Ag_j\)), selecting an optimal plan to protect it from the DDoS attack.

Fig. 15
figure 15

A space-time diagram for Cloud recovering abnormal-collaboration situation

6 Performance evaluations

This section presents the performance evaluation of the CIoTAS protocol. The evaluation is based on the number of exchanged-control messages in the network and the time of response for the requested services.

6.1 Message-based evaluation

6.1.1 Discovery phase

Let assume that, for the discovery phase, an agent sends at least N messages (where \(N \ge 1\) ) in order to discover its context. As a result, and in the best case, the agent will receive N response after sending the discovery messages; otherwise, in the worst cases, the agent doesn’t receive any response. In this situation, the number of messages in the discovery phase is:

$$\begin{aligned} N \le nbr \le 2*N \end{aligned}$$
(1)

In this expression, nbr is the number of control messages in the network.

6.1.2 Collaboration phase

Two cases should be distinguished:

  • The request is achieved by a contextual agent: Based on the discovery-phase output, the agent selects a provider agent to send a request for a giving action a. Therefore, in the best case, the agent sends one request message to the selected agent that may execute action a. Hence, the agent waits for a result of the requested agent. In the worst case, the agent didn’t receive any result from all agents in SCA unless the last requested one; assuming that the entire discovered agents can execute action a and the cardinal of SCA is \(\vert SCA \vert = n\), the maximum number of request messages is n. Where the number of messages for the request phases using only the contextual healing is:

    $$\begin{aligned} 2 \le nbr \le n+1 \end{aligned}$$
    (2)

    In this expression, n is the number of request messages in the network and 1 corresponds to the receiving normal result. The previous inequality may easily be generalized for the case when the agent requests L actions (\(1 \le L \le \vert SDA \vert \)) as follows :

    $$\begin{aligned} 2*L \le nbr \le L*(n+1) \end{aligned}$$
    (3)
  • The request is achieved by a shadow agent at the Cloud This case concerns the situation in which the agent didn’t find a contextual agent executing a requested action(s). Hence, for each requested action, it spends already n contextual control messages (request messages). Since the Cloud healing needs two messages, then the number of control messages in the network is n+2. Yet if the agent pre-selects an agent from its SFA, then the number of messages, in this case, is seven, as it will be explained in Sect. 7.1.2. In the general case where the agent requests more than one action (L actions), the number of control messages is:

    $$\begin{aligned} L*(n+2) \le nbr \le L*(n+7) \end{aligned}$$
    (4)

6.2 Time to response based evaluation

6.2.1 Discovery phase

So far, we defined the number of messages in the network according to each phase. Hence, in the following, we evaluate the amount of time \(\tau \) for each phase to achieve the required tasks.

Theoretically speaking, if we suppose that the time of message transmission between two agents is T under the pre-defined hypothesis, the time of the discovery phase in the worst case, in which the agent doesn’t receive any responses (context changing), is:

$$\begin{aligned} 2*T \le \tau \le \infty \end{aligned}$$
(5)

6.2.2 Collaboration phase

Two cases should be distinguished:

  • The request is achieved by a contextual agent The time to response for achieving the collaboration phase is limited between the minimum-contextual responding time. In a normal situation, if the first solicited contextual agent is responding, and the maximum contextual responding time, in the abnormal situation; if the last solicited contextual agent is responding (the last agent in the requested agent SCA). The time to response is evaluated as follows:

    $$\begin{aligned} 2*T*L \le \tau < L*(n*(T+\Delta )+T) \end{aligned}$$
    (6)
  • The request is achieved by a shadow agent from the Cloud In the case of using the Cloud solution, we distinguish two cases. The first case, if the agent changes its context without updating its SCA and its SFA maybe NULL, the time to respond to heal from this case is the same in inequality 6 plus \(2*T\) amount of time needed for sending the diagnostic message (with null input in agent parameter) and the receiving of results. In the second case, the requested agent selects an agent from its SFA to be diagnostic and repair, in which the time to respond to heal from the poor-collaboration phase is the same in inequality 6 plus the spending time for diagnosable messages:

$$\begin{aligned} L*(n*(T+\Delta )+3*T)\le \tau < L*(n*(T+\Delta )+8*T) \end{aligned}$$
(7)

Where, the number of abnormal messages is equal 7 as explained in Sect. 7.1.2.

7 Implementation and experimental results

In this section, we will present the implementation phase of the CIoTAS protocol. Additionally, we select a scenario for IoT solutions to investigate the robustness of it, the behavior types, and the exchanged-control messages are presented as well.

7.1 Implementation

The implementation of the proposed protocol was done using the Jade platform (https://jade.tilab.com/) version 4.5 in a java environment with the jdk compiler version 7 as a selected tool that supports the communication between the agent based on the asynchronous message. Our experience involved using a 2nd generation Intel I3 processor with 2 cores and 4 threads, 4 Go Ram and a 2.2 GHz clock speed, running on a 64-bit Windows 7 OS.

Before illustrating the chosen application scenario, we define first the life cycle of each agent (or shadow agent) that contains the different reaction types to the defined events (see Sect.  5.11.1) under the healthy and unhealthy system state according to the jade platform. Mention that the agent and the shadow agent are implemented in two different containers.

In general, the protocol events are sending/receiving messages, or are routines. The routines, such as discovering, are a ticker behavior that is launched periodically, in which the period is defined by the application developer. The sending events, such as sending a diagnostic or a result messages, are one-shot behaviors, but the request event is generic behavior that has a done function depending on the SDA elements. However, the receiving events are cyclic behaviors that wait for responding to the requested tasks.

In the following, we present the syntax of the different exchanging control messages under normal and abnormal situations.

7.1.1 Normal situations

The formal notation of control messages in normal situations, are:

  • (Discover, Agent)

  • (Response, Set of Actions, Agent)

  • (Request, Action)

  • (Result, Action, Action-Results, Agent)

7.1.2 Abnormal situations

Note that the exchanged-control messages in abnormal situations using contextual healing are the same last messages, Request and Result. Otherwise, in the following, the formal notation of control messages using Cloud healing:

  • (Get-Diagnostic, Action, Agent)

  • (Signature, Agent)

  • (Get-Shadow-State, Action, Agent-shadow)

  • (Get-State, Action, Agent)

  • (Result, Action, Action-Results, Agent-shadow)

  • (Shadow-Diagnostic, State, Agent)

  • (Diagnostic-Result, Agents, Agents)

7.2 Smart-building application

The smart-building application offers a rich environment to implement many scenarios. In the smart-building application, the CIoTAS protocol could be applied in several collaboration tasks (actions), such as calculation tasks, data collection, or controlling actions (Akram et al. 2016). Mention that, in such application, the IoT sensors/ devices are embedded in the building, provided to the end-users under specific privileges, and used into the following cases:

  1. 1.

    Secure access control for end-users to building section (e.g., rooms).

  2. 2.

    Data personnel collection The end-users can collect data for the individual task, directly without an intermediate element (cloud platform or server). In this case, data could be metering, energy consumption, and HVAC (Heating, Ventilation, Air Condition).

  3. 3.

    Smart space flow analytics Guiding the users to the required direction and keep tracking them over the road, an example of it indoor navigation.

  4. 4.

    Open source and reservation The users are informed about the provided servicesFootnote 7 in the building, as they required.

  5. 5.

    Lighting and energy controlling The users can control the lighting model according to their needs to minimize building energy consumption.

7.3 Experimental scenario

In a smart building, each person can benefit from the available services of the building, using only their devices (smartphones). The CIoTAS protocol could be downloaded as an assisting application that scans the provided services. The protocol applied in indoor navigation application; by guiding the users to achieve their destination based on their location (steps calculator) in the building (lab) structure, situated in the University of Constantine 2 – Abdelhamid Mehri, Constantine City, Algeria (see Fig. 16).

Fig. 16
figure 16

University of Constantine 2—Abdelhamid Mehri, Constantine City, Algeria location

The lab includes many sections and regroups different faculties (computer science, economic, etc). The structure of the lab is divided into three blocks A, B, and C, respectively from right to the left as shown in Fig. 16, in which it has two access from blocks B and C, each block contains several floors, and each floor contains many rooms named as 1A1, 1A2, 1A3..., according to the attached floor and the sequence of rooms. The MISC lab is situated on the left side of the building on the second floor of the block A, rooms 1A4 and 1A5.

The users use the integrated protocol in the app, as a service, to navigate and to access the available-lab services. Mention that the embedded IoT devices and the mobile ones are associated with cartesian coordinates (x, y) in the grid of the building scope to capture each movement. For that, the protocol provides to the users the possible paths to reach their destination and calculates their steps.

Fig. 17
figure 17

Smart Building (University Abdelhamid Mehri Constantine 2 Lab) Application

Thus, the requested service is the calculation of the distance between the initial position (\(P(x_1, y_1)\)) which is randomly generated in the lab’s range, and sub-fixed positions (\(Q(x_2, y_2)\)) according to the user position following by calculating the user steps until reaching the requested destination.

7.4 Experimental results

In this section, we present the experiments to investigate the robustness of CIoTAS protocol according to the previous scenario of indoor navigation in the smart building. We assume that the users are moving geographically (Assumption 3), in a virtual-building environment range, and can request at any time the embedded IoT devices, in the building, for guiding. Figure 17 presents the path of one user that asks for the MISC lab destination (captured in a mini instant).

Under this request, we also assume that the IoT devices could be stable on each floor (e.g., sensor and actuators) or moving likes a human agent that carries on their Smartphone. Numerically speaking, we measured the number of exchanging control messages for 20 IoT devices and three users for each experiment. Mention that the collaboration process is launched after 5 s (sec). We suppose that the discovery phase is launched for each experience, and the 3 random IoT devices round each user are denied, busy, or failed after a random time, and the time to respond (\(\Delta \)) is 5 s. The experiment results in the normal/abnormal situation are stored in files to plot them.

Figure  18 illustrates the experimental results in normal situations for one experiment, in which the number of control messages in the network is 108 messages. To demonstrate this number, in the following, we explain in detail the change in the number of control messages over time. The first and second user SCA apps contain 20 devices after the discovery phase, and both users request are delivered normally (both users apps exchange 42 control messages), besides, the last user SCA app contains 11 devices, and it collaborated normally (it exchange 24 control messages), which verify the performance-evaluation Eq. 1 (e.g., \(20 \le 40 \le 40\) for the fist user app) and Eq. 2 (e.g., \(2 \le 2 \le 21\) for the fist user app), and the total time to accomplish the collaboration tasks (around 9 s= (s)).

The first two users apps discover all the existing devices that happen to be in this experimental in their range; in fact, the apps send 20 messages for the discovery, and the third user app discovers only 12 devices. According to time elapse(s), the network contains, at \(t=0\) s, 44 messages: 20 messages sent from the first user app, 20 messages sent from the second user app, and 4 messages sent from the third user app.

As a result of sending, the building devices respond to each user app, as follows: the first two users receive 20 responses from the devices, and the third one receives 12 messages. Respectively, the first user app gets, at instant \(t=2\) s, 9 messages as well as the second user; however, at instant \(t=3\) s, the third user starts receiving the response messages (1 message). Also, the first two users get, at instant \(t=4\) s, 10 messages (5 messages for each one). In addition, at instant \(t=5\) s, the discovery phase is completed for all the users’ apps; both the first two users receive the last 6 messages, and the third user receives 11 messages.

Fig. 18
figure 18

Number of massages in the network for normal condition

Besides, for the collaborative phase, all users selecting a pre-discovered agent stored from the previous events to request it, and happen to be that this agent is not denied. For that, it responds to the request. The first user engages immediately by sending a request at instant \(t=5\) s; on the other hand, the last two users apps send their requests at instant \(t=6\) s. Thus, the number of request messages is 3 in the network. As a consequence of that, all the users’ apps receive the results at an instant \(t=9\) s (2 s for establishing the results).

Figure 19 presents another experimental result in abnormal situations for one experiment. We suppose that three agents around each user return NULL results which reflect the busy state where the request is denied by the provider. The number of messages in the network, in this case, is 52 messages, 14 messages for the first user app, 12 messages for the second user app, and 26 messages for the last one. The total time to accomplish the collaboration tasks is 6 s. Where the contextual healing is launched after 5.5 s the first user, and the second user is receiving the results after 5.58 s, but the last agent tried the selection more than three times to receive the results after 5.608 s.

Fig. 19
figure 19

Number of massages in the network for abnormal condition using the contextual healing

As we can observe, the discovery phase launched and accomplished for all user apps in the same instant \(t=0\) s. The first user app discovers 5 contextual agents, the second app discovers 5 contextual agents (\(SCA=5\)), and the third app discovers 9 contextual agents (\(SCA=9\)); verify the performance-evaluation Eq. 1 (e.g., \(9 \le 18 \le 18\) for the third user app) and Eq. 3 (e.g., \(2 \le 4 \le 10\) for the third user app). For the request part, the first user app requests more than twice by sending three request messages and receives the results from the last one (the first ones are denied). The second user app, request once and receives the results; in other words, the selected provider agent is operated in a normal state. For the last user app, it sends over three times the request to the service providers. Obviously, the first three provided agents denied the request, but the fourth provider accepts the request and responds to it by establishing the results and send them (the sent results do not equal to NULL).

The above experimental examples may be taken as a reference for future studies to improve the performance of the proposed protocol under suitable usage of energy in the IoT devices.

8 Discussion

In this section, we compare the presented related works (see Sect. 4) based on the IoT based system and based on the CloudIoT paradigm with the CIoTAS protocol. In addition, we summarize both comparisons in tables.

As said above, we start by comparing the selected approaches according to the used methods, the processing nature, the selected system-level and impairment sub-level, the monitoring element, the healing and the protection strategies, and the QoS factors.

Note that the selected method depends on the nature of the process in which the diagnostic unit is placed (could be followed by the other self-healing attributes). The issue, in the proposed work by Kühn et al. (2018), is the placement of the monitoring component to manage the system state. This placement may influence either the detection process in the IoT system at run-time, especially in the case of temporarily unavailable nodes or the diagnosis attribute that defines the plan for recovery, such as event-condition-action.

Table  2 summarizes the selected IoT based system related works. It can be easily seen from Table 2 that the detection attribute is the basic element for all the above presented related works using the Centralized IoT approach to protect the IoT devices against the DoS attacks. Just as the proposed SWARD (Montoya et al. 2018) that mitigated the existence of the token attacks in the network (precaution action) to maximize the confidentiality of the IoT device in a Centralized IoT. Although, the real-time reaction to an attack is absent to deal with the device availability issues.

In the same perspective, the detection based contribution presented by Carl et al. (2006) is used to enhance the level of security in the IoT system. Although, the contribution of WISeKey (2016) deals with the DoS attacks after detecting it with fixed action as update or password modification as a preventive task for future using of devices in the IoT system. Besides, the organization of the diagnostic element in Sharma et al. (2017) gives a reach framework for knowledge sharing of the diagnostic results in IoT systems, in which the attacker’s profile is analyzed as pre-processing in this element.

In opposite, CIoTAS protocol is implemented in the IoT device and is executed in a distributed nature, which achieves the properties of the high-distributed IoT system in the collaboration process with the presence of autonomic behaviors. Subsequently, the system reacts to both failure or DDoS attacks using the substitute device in the same context or its shadow (replica) in the Cloud, which maximizes the IoT system availability and minimizes the healing process cost. It is important to mention that the CIoTAS protocol applied to heal from the calculable, data collection, and control services. In the case of mechanic services, the system may follow the same concepts of fault tolerance system to deal with abnormal situations.

Similarly, Table 3 summarizes the main features of the related work proposed methods in the presence of the Cloud and those of our solution. Obviously, the presented contributions are used without the existence of the Cloud-IoT platforms to manage the Cloud side of the IoT system.

In Bajunaid (2015), the Cloud-Orbit founder does not invest yet an over-the-air (OTA) update process to maximize the security in IoT devices. On the other side, in Yaseen et al. (2018), attack detection has been done in the WSN system; those types of systems have a high homogeneous nature that simplifies the detection process. Although, the absence of the intersection element (device) between the IoT islands makes the Kim et al. (2017) contribution not possible to apply the migration process. Besides, the approach proposed by Kühn et al. (2018) focuses on the same element IoT hub (gateway) as a central unit based on connected intranets of things.

The absence of Cloud platforms in the previous literature limits the approaches to be a device controlling solution rather than manage it. Thus, CIoTAS protocol requires the existence of the platform to integrate the \(S^2aaS\) model (shadow element) as a backup solution. Table 4 summarizes the comparison of CIoTAS protocol with the selected Cloud platforms based IoT on Sect. 2.2.4.

9 Conclusion and perspectives

In this paper, a distributed diagnosable protocol based on autonomic computing for the hybrid-IoT system, named CIoTAS protocol (for Cloud IoT Available Services), has been proposed under the collaborative process to manage the unpredictable behavior of the IoT devices (things) in the case of potential DDoS attacks or device failure and to ensure the availability of their services at real-time execution.

The proposed protocol integrates multi self-management abilities: the self-discovery ability to create a contextual perspective, the self-healing ability by matchmaking to select a substitute contextual agent that can execute the required action, as a contextual-recovery plan. Besides, it uses the advantages of the cloud as a backup for the recovery plan, and it integrates the self-protection abilities to deal with abnormal behaviors in IoT devices.

In order to assess the efficiency and robustness of the proposed approach, the self-healing control loop in collaboration situations, and the self-protection loop in a single device have been proposed. Besides, the convergence and closure properties have been adapted in the context of the CIoTAS protocol and their correctness has been proved; additionally, the theorem 1 assessing the correctness of the protocol has been proved. Our comparison finding revealed shows that the proposed-diagnosable protocol is the first designed one for the hybrid-IoT system based on Service-Oriented Architecture (SOA) for IoT middleware in the things-oriented vision to deals with the service availability problem, transparently.

Thus, future research could investigate in the analysis of the effects of insider attacks against the CIoTAS protocol, and the different threats related to DDoS attacks; besides, the enhancement of the proposed protocol performances for suitable implementation of it on the IoT devices in the energy consumption part. As a perspective of research, we intend, also, to integrate the Cloud-healing solution of the proposed protocol in the OpenIoT platform that supports the existence of the \(S^2aaS\) model. In addition, we aim to model the protocol that is provided with an agent-execution plan by the high-level specification model as \(Ag-LOTOS\) and \(Time-Ag-LOTOS\) (Chaouche et al. 2016; Boukharrou et al. 2015).

Table 2 The comparison between the approaches based-IoT
Table 3 The comparison between the approaches based Cloud technology
Table 4 The comparison between the approaches based Cloud IoT platforms